Sources of raw data
The creation of analysis datasets, as described on this page, starts from the cleaned and aggregated raw data. The methods of collection of the original data are described in the main study pages (links top left of this page); there are additional pages describing data entry, data cleaning (of booklet and web data), and storage of data files.
Except for twin web test (and phone app) data, the cleaned and aggregated raw data for each study are stored in an Access database file. Typically, the Access database will contain separate tables for twin data, parent data, teacher data and admin data. Access is a convenient format because multiple tables can be stored within the same database file; because the database can incorporate data-entry forms and other useful programs in addition to the data; and because it can enforce various data integrity rules to ensure that the data remain clean when updated.
The files of twin web/app data (collected in web studies from age 10 onwards, via a phone app from age 21 onwards) are generally too large for convenient storage in Access. Each web data file has very many columns, and would have to be split into several database tables. Furthermore, the raw electronic data files are complete, in the sense that further data are not added manually, and there is no need for the data entry and data integrity facilities that Access provides. Instead, the aggregated data from each test/study is saved in a single SPSS data file. These SPSS files may be periodically updated during the course of the study, as more web data are collected; but once the study ends the SPSS files become "static", and do not need further updating in any way.
Some of the raw data used to make analysis datasets are not study-specific, but consist of general family background data. These data include twin sexes, birth dates, birth orders, zygosities, and medical conditions (used to classify medical exclusions). These types of data are maintained in the TEDS admin database; they are continually updated as a result of new information from families, zygosity test results, and so on. These admin data are stored in the TEDS_ADMIN SQL Server database, running on the TEDS database server. This database is quite large and has many tables; the significant tables in this context are Twins (containing twin IDs, birth orders, sexes and birth dates), TwinMedicalCondition (containing twin medical exclusion and autism categories) and DNAprogress (containing DNA zygosity tests).
Exporting raw data
SPSS, which is used to process the data, cannot easily open data directly from Access database tables. It is therefore necessary to "export" the raw data by copying them into a more suitable file format. The .csv file format is chosen because it is platform-independent, can be read by many different software programs, has no limit on the number of rows or columns of data, and is non-proprietory (and therefore not subject to format changes with different versions, etc).
Generally, the data from each Access database table is exported into a separate csv file. A csv file consists of plain text, which can be opened in text editors like Notepad, and can also be opened in grid format by Excel. The csv format means that variables are comma-separated in rows of data, and each case starts on a new line. The first line of the file contains the column headings (or variable names).
Data are exported indirectly from Access tables using saved "queries" within each Access database file. Each query selects appropriate columns from the relevant table, hence excluding unwanted columns such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a form that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and boolean true/false columns (changed to integer 1/0 values).
One way of exporting data from each Access query is to run the query, then use the Access external data commands to export the query as a text file; select csv (specify commas as variable delimiters), specify that column headings should be included in the first row, and enter the appropriate filename and folder for the exported file. In order to simplify this task, a macro has been created in each Access database; simply run this macro, and each query in turn is run and exported to the appropriate file.
The exported csv files for each study are saved in an \Export\ folder, as described in the data files page. The SPSS files of twin web test data are stored in the same place. The files in this folder are the ones opened directly by the SPSS scripts when the analysis datasets are created.
Background data from the TEDS admin database (SQL Server) can be exported in a similar way, if an Access front-end interface is used. The exported csv files from this source are stored in the \Export\ folder for the 1st Contact study.
Before a new version of the analysis dataset is made, by running the scripts described below, it is not always necessary to re-export the raw data. Some new dataset versions incorporate changes made to the syntax in the scripts, but not changes in the raw data. Exporting the data from Access is therefore only necessary if (1) a new version of the dataset is about to be created, and (2) either new raw data has been added, or the raw data have been updated by cleaning.
Processing by scripts
Creating one of the TEDS analysis datasets involves a fairly long sequence of operations, which are described under the headings below. These operations are carried out in statistical software (SPSS). The instructions for each operation, and the order in which the operations are performed, are encoded in proprietory syntax for SPSS; the syntax is saved in a set of "scripts" or syntax files. Each script is a plain text file, saved with a proprietory file extension (.sps for SPSS syntax).
Typically, for each dataset, the syntax is divided into several scripts, mainly in order to reduce the length of each script; each script generally contains a logically related set of functions, such as merging raw data sources, or creating scales; and this makes it easier to find a given piece of syntax, and to understand the operations being carried out. The scripts must be executed in a given order for each dataset, as the syntax operations follow each other in a logical sequence.
Each script contains not only syntax but comments. The comments give an explanation of the operations being carried out. The syntax and comments together form a crucial part of the documentation for each dataset, as they explain exactly how the dataset is constructed.
Different TEDS datasets have been created using slightly different syntax methods, sometimes with operations performed in different orders. However, they generally all involve the operations that are described under the headings below. Some examples of variations between the ways the datasets are constructed are:
- The 1st Contact and 8 Year studies involved only parent booklets/questionnaires and no twin data. Hence, any steps involving on merging using TwinID, and double entry of twin data, do not apply.
- The in home study has its own exclusion variable, based on information collected at the time of the visits. Hence, the usual across-age exclusion variables are not used.
- At ages 7, 9, 10, 12 and 14, the studies involved teacher questionnaires. Hence, additional steps are needed to merge and process the teacher data (treated in essentially the same was as twin data).
- At ages 10, 12, 14 and 16, the studies included twin web tests and (in some cases) parent web activities, in addition to the usual paper booklets/questionnaires. Hence, construction of these dataset involves additional files, typically one per web activity. The web activities were very varied and required different types of processing.
- In TEDS 21, parents and twins had a choice of completing the questionnaires via mobile app, on the web or on paper. Hence, there are separate files for the different sources of data which must be made compatible, merged together, then cleaned to remove duplication for some participants.
Importing raw data
Before further processing can be carried out, the raw data must be "imported" into SPSS, from the files saved in the \Export\ folder for the given study (see exporting data above). Most of these are csv files, although the twin web test data are already saved in SPSS files. Importing data generally means opening a file of a foreign type (e.g. csv) and copying its data into an open SPSS data file. In SPSS, the syntax for importing a csv file also allows the name and data type of each variable to be specified.
The operation of importing a raw data file is generally followed by sorting the rows in ascending order of subject ID, sometimes followed by variable naming and recoding, followed by saving the file in SPSS format to be merged at a later stage.
In SPSS, variable naming is often done simultaneously with importing the raw data, because the syntax for importing a csv file allows the variable names to be specified at the same time. In some dataset scripts, the renaming of variables (from raw variable names to final analysis dataset variable names) is done at a later stage.
The naming of variables in TEDS datasets follows certain conventions. Historically, variable names were restricted to 8 or fewer characters, because this was a requirement in older versions of SPSS and SAS. Variable names in more recent datasets are often longer. The first letter of each variable name generally denotes the study in which the data were collected: 'a' for 1st Contact, 'b' for 2 Year, and so on. The position in the alphabet of this initial letter denotes the approximate age of the twins when the data were collected; hence there are variable names starting with 'g' but not 'f', because there was a 7 Year study but no 6 Year study.
In some datasets, notably from the 7 year study onwards, a second letter is added to denote the respondent: 'p' for parent, 't' for teacher and 'c' for child (twin self-report). Additional characters may be used to denote different parts of the study, for example 'ex' to denote exam results at age 16, 'n' to denote the navigation study post-18, and '1' and '2' to denote phases 1 and 2 within the TEDS21 study.
Following these prefixed letters, many variable names have 2-4 letters denoting the measure (e.g. 'sdq') followed by a number to denote the item number within the measure. Scales may be suffixed with 't' (denoting a total) or 'm' (denoting a mean). However, there is no fixed pattern for all variables.
Another naming convention is used for twin-specific variables, to distinguish between the same data item for the two twins in a single row of a double entered dataset. The twin identified by ID in a given row of data is referred to as the "index" twin; all twin-specific variables for this twin in the dataset have names ending in '1'. The variables for the index twin's co-twin, in the same row of data, have names ending in '2'. As a result, each twin's data is effectively duplicated within a double entered dataset: the data appear once in the index twin's variables for the given twin, and once in the co-twin's variables for the other twin.
To make variables easier to use, they are formatted in SPSS by modifying properties such as the "width", "decimals", "columns" and "measure".
Firstly, by using the SPSS FORMATS syntax command, or by specifying the format in syntax for importing a text file, each variable's "width" and "decimals" properties are set. For example, the SPSS format F1.0 would be used for a single-digit integer variable (width=1, decimals=0), and format F5.2 might be used for a decimal variable such as a twin's height (width=5, decimals=2). Setting the format does not alter the values stored in SPSS; it merely alters the default display of variable values in SPSS.
Secondly, by using the SPSS VARIABLE LEVEL syntax command, each variable can be described (in the SPSS "measure" property) as nominal, ordinal or scale. Conventionally in the TEDS datasets, integer-valued categorical variables are set to nominal or ordinal as appropriate, and this property makes it easier in SPSS to use the variables in analyses such as tabulations. Other variables set to measure=scale include quantitative measurements such as weights or ages, scales, composites and decimal-valued variables generally. String variables are nominal by default.
Further variable properties may also be set in SPSS syntax. These properties include the "columns" (set with the SPSS VARIABLE WIDTH syntax command), which determines the width of the column displayed in the SPSS data view. The default width is often appropriate, but may be changed when it is inconveniently wide or narrow. Another variable property that may be set is the "missing" property. For a variable containing a value such as 2=don't know or 9=not applicable, such a value can be set as "missing" in SPSS (using the MISSING VALUES syntax command); the result is that the value is conveniently treated as missing in any SPSS analysis, including computation of scales.
In the raw data, missing values are generally denoted by the code value -99, and not-applicable values are denoted by the code value -77. These codes are not used in the analysis datasets, so these values are recoded to missing for all item variables (in SPSS, a system-missing or 'sysmis' value is used).
Often, the value coding used in the raw data differs from that used in the analysis datasets. For example, simple value codes of 1,2,3,... are often used in data entry for the raw data, but in the analysis dataset it might be more convenient to used value codes of 0,1,2,... (often to aid the computation of scales). Reverse-coding is sometimes necessary, to ensure that all related item variables are coded in a consistent way. The scripts therefore contain syntax for changing the value codes where necessary.
Occasionally, where booklet data have not already been fully cleaned in the raw state, there are attempts in the scripts to identify and recode invalid or compromised data values. Where identified, invalid values are generally recoded to missing, to prevent them being used in analysis. Examples of invalid values might include extreme or out-of-range quantitative measurements, such as ages or heights. Examples of compromised data might include inconsistent responses in multi-part questions.
Recoding and cleaning web/app data
The data collected in web studies (from age 10 onwards) and app studies (from age 21 onwards) are recoded in order to replace missing values, to identify item events such as timeouts, discontinues and branching, and to identify anomalies such as item crashes.
Furthermore, attempts have been made to identify instances of tests that were compromised by loss of data, malfunctioning of test rules, or random responding by twins. Such instances are flagged using test status variables, and are effectively excluded by setting the test variable values to missing.
The processes involved in recoding and cleaning the web data are described in more detail in a separate page.
Double entering parent/family data
In parent-reported questionnaire data, and in some admin data, there are items referring specifically to the elder and younger twins, and these occur in the same row of data for each family. Double entering these data consists of two main operations in the script: firstly, the data must be duplicated so that there is one row per twin, not just one row per family; secondly, the twin-specific variables must be "swapped over" in the duplicated data so that they correctly match the twin and co-twin for the relevant row of data. The script usually follows the following stages:
- Start with the parent- and/or family-specific data, as imported from the raw data files. This is often done after merging all the family-specific data together (using FamilyID as the key variable), but before merging the twin/teacher data.
- This is usually the point at which the random variable is added (see also below).
- Within the family-specific data, if not already done, rename twin-specific items so that "elder twin" item variables have names ending in '1', and "younger twin" item variables have names ending in '2'.
- Compute the twin identifier Atempid2 by appending the digit 1 to the end of the FamilyID value.
- At this stage, the dataset represents the elder twins; each variable with name ending in '1' represents the elder twin as identified by Atempid2, while each variable with name ending in '2' represents the younger twin, i.e. the co-twin. Save this as the "top" half of the dataset.
- Now make an equivalent dataset for the younger twins. Starting by re-computing the twin identifier Atempid2, so it ends in 2 instead of 1.
- Also, for the second twin, reverse the value of the random variable (0 to 1, or 1 to 0) that has been derived for the elder twin.
- Now swap over the twin-specific variables: the variables having names ending in '1' must be swapped over with the variables having names ending in '2'. This is conviently done by renaming the variables: for example, to swap variables var1 and var2, rename var1 to var1x, then rename var2 to var1, then rename var1x to var2.
- At this stage, the dataset represents the younger twins; each variable with name ending in '1' represents the younger twin as identified by Atempid2, while each variable with name ending in '2' represents the elder twin, i.e. the co-twin. Save this as the "bottom" half of the dataset.
- Combine the cases from the "top" and "bottom" datasets saved above, into a single dataset.
- Re-sort in ascending order of Atempid2.
- The family-specific data is now double entered. There is one row per twin, each twin being identified uniquely by Atempid2. The variables having names ending in '1' contain data for this twin (whether elder or younger). The variables having names ending in '2' contain data for this twin's co-twin.
Creating the random variable
The variable named random has values 0 and 1. For a given twin, the value 0 or 1 is assigned randomly in this variable; however, co-twins must have opposite values. So if the elder twin has random=1, then the younger twin must have random=0, and vice versa. During analysis, if the dataset is filtered or selected on either random=1 or random=0, then exactly half the twins will be selected, but only one twin per pair is selected; and the selection will include an approximately 50:50 mix of elder and younger twins.
Computing this variable cannot be done directly in a dataset that already has one row per twin, because twin and co-twin values of random are not independent in different rows of the dataset. The random variable is often added to the dataset during double entry of the parent/family data (see above), because both processes involve combining "top" and "bottom" halves of the dataset. The processes involved are as follows:
- Start with a dataset containing the elder twin cases only. Elder twins are typically identified by variable Atempid2, with 1 as the final digit.
- Add the random variable, assigning the value 0 or 1 randomly to each elder twin.
- Save this as the "top" part of the dataset, representing the elder twins.
- Re-compute the twin identifier Atempid2, by changing the final digit from 1 to 2. This now identifies the younger twin cases.
- Modify the random variable, by recoding 1 to 0 and 0 to 1.
- Save this as the "bottom" part of the dataset, representing the younger twins.
- Combine the cases from the "top" and "bottom" datasets saved above, into a single dataset.
In the second step above, the random variable can be computed in the syntax using an appropriate SPSS random number generator function. In SPSS, a suitable command is COMPUTE random = RV.BERNOULLI(0.5). which will generate approximately equal numbers of 0 and 1 values randomly. Alternatively, a ready-made set of random values (one per family, matched against FamilyID) is available in a raw data file for importing and merging. The latter method was used historically in older versions of the datasets, whereas recent versions use a newly-computed value of random each time the dataset is made.
Merging data sources
In a given dataset, data typically come from several sources, e.g. from parent, twins and teachers, and from different booklets, questionnaires, web tests and admin sources. Separate raw data files are usually used to store the data from different sources; each raw data file is imported and saved as an SPSS data file (see importing raw data above). These files must be combined together, or "merged", into a single dataset. This must be done in such a way that all data relating to a given twin appears in the same row of the dataset; to do this, the files must be merged using a twin identifier as the key variable.
Many background variables that are the same across all datasets are conveniently merged in from a reference dataset containing such variables. This minimised the need to duplicate steps such as importing raw data and deriving the background variables in every dataset. Background variables treated in this way are typically zygosities, exclusion variables. See the background variables page for more details.
Different sets of data may identify twins in different ways, so sometimes it is necessary to modify the IDs so that they match up for merging. In raw twin and teacher data, twins are usually identified by a variable called TwinID. In raw parent data, each case is usually identified by a family identifier called FamilyID. As explained above, after double-entering the parent/family data, twins are identified by a variable called Atempid2. In the most scripts, files are merged in this order:
- Merge all per-family data (parent and admin data) on FamilyID.
- Double enter the per-family data, resulting in a file where twins are identified by Atempid2.
- Merge all per-twin data (twin and teacher data) on TwinID.
- Replace TwinID with Atempid2 in the per-twin data.
- Merge the per-family data with the per-twin data on Atempid2.
The means of replacing TwinID with Atempid2 is provided by the raw data file (exported from the TEDS admin database) containing the TwinID, FamilyID and twin birth order for every TEDS twin. After importing this file, Atempid2 can be computing by appending the twin birth order (1 or 2) to the end of the FamilyID value. This file can then be merged with the per-twin data (on TwinID), hence providing the Atempid2 variable needed for merging with the per-family data.
Before two datasets are merged, they must both be sorted in ascending order of the relevant key variable. The files must therefore be re-sorted if a new key variable is about to be used for the next merge. In the scripts, data files are usually sorted prior to saving, in steps preceding each merge.
Before merging, each set of data from a particular source is given a new flag variable to show the presence of data from that source. Sometimes these flag variables are already present in the imported raw data; if not, they are computed after importing the data into SPSS. These flag variables can be used later in the scripts to identify any cases without data from any source in the study; such cases can then be deleted.
The family and twin identifiers used in the raw data are the same as those used in the TEDS administrative system and in all contacts with the families: these are variables FamilyID, TwinID and the related Atempid2. These IDs are easily matched against identifiable individuals, and should never be used in analysis datasets. In order to protect the identities of twins and their families, and to maintain confidentiality in the data, alternative "scrambled" IDs are used in the analysis datasets.
The scripts therefore contain syntax to replace the original IDs (FamilyID, TwinID, Atempid2) with pseudonymised IDs (id_fam, id_twin). The syntax used for this is described in detail in the scrambled IDs page. After computing id_fam and id_twin, the original IDs are dropped from the dataset.
During construction of each dataset, the scrambling process is conveniently done after all the data have been merged together. This means that the scrambling syntax (which is quite long) is only needed once in the script.
The pseudonymous ID variables, id_fam and id_twin, have the same values for the same twins across all TEDS datasets, allowing these dataset to be merged together for the purpose of longitudinal analysis.
In theory twins are still identifiable from the identifier id_twin values (although the scrambling script and algorithm are not disclosed to researchers, in order to guard against this possibility). This means that datasets containing variables id_fam and id_twin are pseudonymous, not truly anonymous. For this reason, as a strict means of protecting confidentiality, these IDs are not released in datasets to researchers in most cases. The exception to this rule is for analysis within KCL of genotypic data, where it is necessar to link the genotypic dataset with the phenotypic dataset.
In datasets released to researchers and collaborators, a further encryption of IDs is carried out to convert them randomly and irreversibly to truly anonymous IDs. This is done after the merging of longitudinal data. This encryption, however, does not form part of the processing used to create the main TEDS datasets, as described on this page.
All variables to be retained in the analysis dataset are labelled, to help researchers identify what each variable represents, and to give information about value codes. While there is no rigid convention for labelling variables, where possible the following information is included in the variable label:
- The study or approximate twin age
- The source of data (e.g. parent, twin or teacher, booklet or web)
- The name of the measure
- For item variables, the item number where appropriate
- For questionnaire items, a truncated or simplified version of the text of the question
- Where appropriate, the nature of the item, e.g. response, score or response time for a web item.
- Some indication of the coding, range of values, or units of measurement
Value labels are also added to categorical integer-valued variables, where there are three or more categories present. The use of value labels removes the need to specify all the values within the text of the variable label.
Creating scales and composites
Scales, composites and other derived variables are generally added to the dataset after the initial stages of merging the data, and before labelling all the variables. Various methods are used to compute scales, but most involve calculation of the mean or sum of a set of related item variables. Sometimes, some of the item variables must be reversed-coded first (if this was not already done earlier in the scripts), so that all the relevant items are coded in a consistent way. Sometimes, the item variables must be transformed in other ways before computing scales: for example, some scales used standardised items; or sometimes, neutral responses like "don't know" must be removed before scaling.
For per-twin data (from teachers and twins), it is helpful to create scales before double entry, so that each scale variable only needs to be computed and labelled once. The new scale variables are then effectively copied and matched to the co-twins during double entry.
For per-family data (typically from parents), scales computed from twin-specific items will be correctly double entered if the item variables were previously double entered (see double entering parent/family data above). If this has not yet been done when the scales are computed, then the scales will subsequently have to be double entered along with the items.
Some derived variables added may be referred to as "composites" rather than scales. Usually, the term "composite" variable implies that data from two or more different measures have been combined together, whereas a "scale" variable generally is computed from items of the same measure. However, the distinction is not strict.
Creating background variables
Many of the background variables are not derived in each separate dataset, but instead are merged in from a reference dataset as described above. This reduces the need to duplicate the same derivations in every dataset, hence shortening the processing and reducing the risk of errors. The background variables merged in this way generally include zygosities and exclusion variables. See the background variables page for details. See also the exclusions page.
However, some background variables must be derived separately for each dataset. These include twin ages, which are specific to the data collection, and sometimes study-specific exclusions for example.
Within each dataset, there are typically different twin age variables for different data collections. For example, within a given study, the data from parent booklets, teacher questionnaires, twin web tests and twin booklets may all have been returned on quite different dates. The two twins in a pair may often return data on different dates, hence their age variables are double entered where appropriate.
Double entering twin data
Raw data from twin-specific sources, such as teacher questionnaires, twin tests and twin booklets, will contain one row of data per twin. Every data item relates to this twin. To double enter these data, the corresponding co-twin data items must be added to the same row of data for every twin. The data items must therefore be duplicated and matched to the co-twins.
Note that double entering the per-twin data, as described here, involves different processing from that involved in double entering the per-family data, as described above. Note also that double entering the per-family data is usually done at any early stage in the scripts, before the per-family data are merged with the per-twin data. However, double entering the per-twin data is more conveniently done later, once all the data sources have been merged together, after items have been labelled, and after scales have been computed.
In most cases, this double entry procedure comes after the script that scrambles the IDs. So the description below assumes that the twins are identified by the variable id_twin. The corresponding family identifier is id_fam. The value of id_twin comprises the twin birth order (1 or 2) appended to the value of id_fam. The same double entry procedure can be used with unscrambled IDs Atempid2 and FamilyID if necessary.
The exact method used for double entry varies between scripts, but the typical procedure is as follows:
- Rename all item variables from twin-specific sources, so that the variable names end in '1'.
- Sort the dataset is ascending order of id_twin.
- At this stage, the dataset represents the data for the "index" twins, i.e. the data relating to the twin identified by id_twin in each row of data. Save this as the "left hand" part of the dataset.
- Now make an equivalent dataset for the co-twins. Start by dropping any per-family variables that do not require double entry, and also dropping variables that have previously been double entered (from the parent data).
- Rename all the item variables so that the variable names end in '2' instead of '1'.
- Re-compute id_twin, to match the co-twin's id_twin value in each case: if the final digit of id_twin is 2, change it to 1, and vice versa.
- Sort the dataset in ascending order of the re-computed id_twin.
- At this stage, the dataset represents the data for the co-twins. Save this as the "right hand" part of the dataset.
- Merge the left and right hand parts, as saved above, using id_twin as the key variable. The dataset is now double entered.
After double entry, it may be possible to reduce the size of the dataset by deleting any twin pairs that do not have any data in the study. This may be done by selecting on the flag variables that show the presence of data from each source. To maintain paired twins in the dataset, and to avoid unpaired "singletons" in the dataset, selection should be based on the double-entered flag variables for twin and co-twin for each part of the data.