TEDS Data Dictionary

TEDS Background Variables

Contents of this page:

Introduction
Summary list of common background variables
Twin sexes
Zygosity
The random variable
Twin ages
LLC ages and dates
School cohort

Introduction

This page lists and describes "background variables" that are often or always included in datasets shared with researchers, and which are routinely needed in many analyses. These variables include general descriptors of twins and their families (for example twin sex and zygosity, family SES, cohort, twin age), exclusion variables, identifiers and the 'random' variable.

Some of these background variables derive from participant details that are stored and maintained in the TEDS admin database, where they are routinely updated through regular contact with participants. Such admin-based details include twin sexes, twin birth dates and data return dates (used to compute ages), results of zygosity tests, and the IDs used to identify twins and families in the raw data.

Other background variables derive from early data collections. For example, some exclusion variables originate from the 1st Contact dataset, and the best estimate of zygosity makes use of data from 1st Contact, 3 Year and 4 Year booklets.

Some of the background variables are described in more detail on other pages, to which links are provided.

Summary list of common background variables

The table below lists background variables that are often included in datasets distributed to researchers for analysis.

Category	Variable name(s)	Brief description	Link for further details
IDs	randomtwinid, randomfamid	Anonymised IDs used in most research datasets.	Go to the scrambled IDs page
	id_twin, id_fam	Pseudonymised IDs used where datasets must be linked together, typically for linking phenotypic and genotypic datasets.
	twin	Birth order (1=elder, 2=younger) of twin within a pair.
	STUDY_ID	Unique, pseudonymised twin identifier used in datasets within the LLC TRE (but not included in other datasets)
Twin sexes	sex1/2	Double entered variables, coded 0=female, 1=male (or missing if unknown)	See twin sexes below
Zygosity	zygos	Twin-pair zygosity, coded 1=MZ, 2=DZ (or missing if unknown)	See zygosity below
	sexzyg	Twin-pair sex-and-zygosity, coded 1=MZ male, 2=DZ male, 3=MZ female, 4=DZ female, 5=DZ opposite-sexes, 7=unknown
	x3zygos	Three-value twin-pair sex-and-zygosity, coded 1=MZ, 2=DZ same-sex, 3=DZ opposite-sexes (or missing if unknown)
Exclusions	exclude1/2	Double entered twin variables, coded 0=not excluded, 1=excluded, encapsulating the four categories of exclusion routinely used in TEDS analysis (medical exclusion, unknown sex or zygosity, perinatal outlier, absence of 1st Contact data).	Go to the exclusions page
	medexcl1/2	Double entered twin variables, coded 0=not excluded, 1=excluded, flagging a medical exclusion.
	acontact	Presence of 1st Contact data for the family (twin pair), coded 1=yes, 0=no.
	aperinat	Twin pair is categorised as a perinatal outlier, hence excluded, coded 1=yes, 0=no (or missing if 1st Contact data is absent). This is a derived variable in the 1st Contact dataset: see aperinat for details of derivation.
Random variable	random	Variable used to select one twin randomly from each twin pair in the dataset, selecting roughly equal numbers of elder and younger twins. Commonly used as a filter in twin data analysis, with values 0 and 1 (select arbitrarily on either value).	See the random variable below
Twin ages	[Variables are specific to each data collection]	Age of twin(s), in years, recorded as a decimal number. May be a single variable for the twin pair, for example if the data were reported by the parent, or double entered variables for individual twins.	See twin ages below
School cohort	cohort	Birth cohort related to attendance in different UK school years. Coded from birth dates as 1=Jan-94 to Aug-94, 2=Sep-94 to Aug-95, 3=Sep-95 to Aug-96, 4=Sep-96 to Dec-96.	See school cohort below
SES	ases	SES composite variable for the family (twin pair), derived from 1st Contact data. It is a composite from five variables: mother and father employment levels, mother and father educational levels, and mother's age on birth of first child. This is the most commonly used SES variable in TEDS dataset, but others are available at later ages (see the measures page).	See 1st Contact derived variables page for ases.
Ethnic origin	aethnic	Ethnic origin of the twin pair, as reported by parents in the 1st Contact booklet, coded 1=white, 0=other.	See variables in the 1st Contact study
Genotyping	genotyped1/2	Double entered twin variables flagging the presence (1=yes) or absence (0=no) of genotypic data for each twin. This relates to the harmonised genotypic dataset of Affymetrix and OEE data for 10,346 individual twins.	See the DNA and genotyping page and the Polygenic scores page.
	DZtwinpair	Flag variable to indicate whether both twins have been genotyped (1=yes, 0=no). Note that some but not all DZ pairs were genotyped; the genetic dataset also contains some unpaired twins from DZ pairs.
	selectunpaired	Flag variable to select only unpaired (unrelated) genotyped twins, including all genotyped singletons and one randomly selected twin from each genotyped DZ pair. Coded 1=yes, 0=no. Filter on value 1 to select the set of 7,026 unrelated twins.
	genotypedzyg	Categorises each twin pair according to zygosity and the number of twins genotyped. Values 0 to 4, coded 0=neither twin genotyped; 1=DZ, both genotyped; 2=DZ, one genotyped; 3=MZ, one genotyped; 4=unknown zygosity, one genotyped
	MZtwinPGScopied1/2	Double-entered twin flag variable (1=yes, 0=no). This identifies each MZ twin, not individually genotyped but with a genotyped cotwin, for whom the polygenic scores have been copied from the genotyped cotwin.
	chiptype	Genetic covariate, showing the type of chip on which each twin was genotyped. Coded 1=Affymetrix, 2=OEE.
	batchnumber	Genetic covariate, giving the batch in which each twin was genotyped. Values 0 to 7, coded 0=Affymetrix (assumed single batch); 1-6 for the main OEE batches; and 7 for the final batch of DZ cotwins.
	PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, PC9, PC10	Genetic covariates, being the first 10 principal components extracted from the harmonised Affymetrix + OEE twin genetic dataset. Each of these variables has continuously-variable decimal values with mean zero.

Twin sexes

Parents were first asked to report twin sexes in the 1st Contact booklet. This became the initial basis for the TEDS admin record of twin sexes, maintained in the TEDS admin database. The 1st Contact record of twin sexes has subsequently been found to be incorrect for a small but significant minority of twins, either because parents made mistakes in recording responses or because of errors in data entry. Subsequently, through repeated contacts with the families, the twin sexes in the TEDS admin system have been checked and corrected where necessary.

The twin sex variables in the TEDS datasets, sex1/2, are therefore taken from the admin database and not from the raw 1st Contact data. Corrections to twin sexes have been very rare in recent years, however it is possible that occasional discrepancies have occurred in earlier versions of the TEDS datasets.

More recently, some twins have reported that they have changed their gender, often with changes of forename, with requests to be addressed as male rather than female or vice versa. In such cases, the reported change of gender is recorded in the admin database separately from the record of biological gender at birth. While the changed gender is used, where appropriate, in communications with the twins, the biological gender (at birth) continues to be used as the basis of the sex1/2 variables.

Zygosity

The variable zygos is included in nearly all TEDS datasets, along with associated twin-pair sex-zygosity variables sexzyg, x3zygos. The meaning and coding of these variables is summarised in the table above. sexzyg and x3zygos are derived directly from zygos (and from twin sexes), so this section will focus on the derivation of zygos.

zygos is the best available estimate of twin-pair zygosity, based on the evidence currently available. This evidence is:

Twin sexes (if opposite sexes, the pair must be DZ)
DNA zygosity tests
The estimate from the zygosity algorithm, based on the zygosity questionnaire. This questionnaire has been administered to parents of twins in these three booklets:
1. 4 Year booklet study
2. 3 Year booklet study
3. 1st Contact booklet study

The list above also defines the order of precedence for deriving a best estimate of zygosity. Hence, if different estimates are available and they disagree, then precedence is given to the DZ estimate if the twins have opposite sexes; if the twins have the same sex, then precedence is given to the DNA test result, if available; then to the 4 Year booklet result, if available; then to the 3 Year result; and finally to the 1st Contact result.

The sources of the variables providing the evidence for zygosity are the TEDS admin database (where twin sexes and DNA zygosity test results are maintained) and the datasets for the 1st Contact, 3 Year and 4 Year booklets (where the zygosity algorithm results are derived).

DNA test results are more reliable than the booklet algorithm results, but DNA results are only available for a minority of twin pairs. The DNA zygosity test was offered to parents as a reward or incentive for returning twin DNA samples, but only if their twins were a same-sex pair. For various reasons, not all parents were asked for DNA samples, not all parents returned them when asked, not all parents requested a DNA zygosity test when they returned the samples, and not all DNA tests gave a conclusive result (see the DNA study page for more details). In cases where parents returned twin DNA samples but did not request a zygosity test, for reasons of cost, TEDS did not routinely carry out DNA zygosity tests except in those cases where the booklet algorithms had been inconclusive or contradictory. Hence, DNA zygosity test results are generally only available for (a) same-sex pairs where parents returned DNA and asked for a test result; and (b) same-sex pairs where parents returned DNA and did not ask for a test result but booklet algorithm results were unavailable or inconclusive.

Similarly, a booklet algorithm result may be unavailable for parents who did not return the respective booklet, or who returned the booklet but failed to complete the zygosity questionnaire, or where the responses in the questionnaire gave a result that was indeterminate (either because of inconsistent responses or because of a mid-range zygosity score) - see the zygosity algorithm page for full details.

For these reasons, the zygosity is still unknown for some twin pairs, and the value of zygos may be missing.

The SPSS syntax below shows how zygos, the best estimate of zygosity, is derived. It is derived from twin sexes (sex1/2, see above), DNA zygosity test results (dnazygos, from the TEDS admin database), and booklet zygosity test results (aalg2zy, calg2zy, dalg2zy, from the 1st Contact, 3 Year and 4 Year datasets respectively).

* Make best-estimate zygosity variable zygos.
* We can do this from scratch using the zygosity algorithm results.
* (ages 1, 3 and 4), DNA zygosity results, and twin sexes.
* Start with the 1st Contact zygosity algorithm result.
* Retain coding 1=MZ, 2=DZ but discard values 5 (mid-range indeterminate) and 99 (inconsistent).
RECODE aalg2zy (1=1) (2=2) (ELSE=SYSMIS)
INTO zygos.
EXECUTE.
* If 3 year algorithm result was conclusive (MZ or DZ).
* then this takes precedence over the 1st Contact result.
IF (ANY(calg2zy,1,2)) zygos = calg2zy.
EXECUTE.
* Similarly the 4 year algorithm result (if conclusive) takes precedence over the 3 year.
IF (ANY(dalg2zy,1,2)) zygos = dalg2zy.
EXECUTE.
* a DNA zygosity test result takes precedence over any algorithm result.
IF (ANY(dnazygos,1,2)) zygos = dnazygos.
EXECUTE.
* Over-rule all of the above if twins have opposite sexes.
IF (sex1 ~= sex2) zygos = 2.
EXECUTE.
* Finally, set to missing if sex of either twin is unknown.
* on the assumption that any zygosity test result cannot be reliable in such cases.
DO IF (SYSMIS(sex1) | SYSMIS(sex2)).
  RECODE zygos (ELSE=SYSMIS).
END IF.
EXECUTE.

The syntax below shows how the twin-pair sex-zygosity variables are subsequently derived from zygos and from twin sexes sex1/2.

Variable x3zygos has three values: 1=MZ, 2=DZ same-sex and 3=DZ opposite-sexes (or missing if unknown).

Variable sexzyg has 6 values: 1=MZ male, 2=DZ male, 3=MZ female, 4=DZ female, 5=DZ opposite-sexes, 7=unknown.

* Create 5-value sex and zygosity variable.
IF(sex1 = 1 & sex2 = 1 & zygos = 1) sexzyg = 1.
IF(sex1 = 1 & sex2 = 1 & zygos = 2) sexzyg = 2.
IF(sex1 = 0 & sex2 = 0 & zygos = 1) sexzyg = 3.
IF(sex1 = 0 & sex2 = 0 & zygos = 2) sexzyg = 4.
IF((sex1 = 0 & sex2 = 1) | (sex1 = 1 & sex2 = 0)) sexzyg = 5.
IF(SYSMIS(zygos) | SYSMIS(sex1) | SYSMIS(sex2)) sexzyg = 7.
EXECUTE.

* Create a 3-value zygosity variable.
RECODE
  sexzyg
  (1=1)  (2=2)  (3=1)  (4=2)  (5=3)  (7=SYSMIS)  INTO  x3zygos .
EXECUTE .

The random variable

The variable named random is included in nearly all TEDS datasets, and is used as a filter to select one twin from each pair. It has values 0 and 1, and either value can be used as a filter: the set of twins selected by random=0 are the co-twins of the set of twins selected by random=1. This is because, within each pair of twins, one twin has random=0 and the other twin has random=1.

The values 0 and 1 are randomly assigned within each twin pair. Hence, filtering on either value 0 or 1 will select some elder twins and some younger twins, in roughly equal proportions. Therefore, filtering with the random variable avoids any potential bias or confounding effects that could result from selecting only elder twins or only younger twins (using the twin variable).

In the construction of each of the main TEDS datasets, the random variable is initially computed at the stage when parent/family data have been merged and there is one row of data per family. The syntax used is as follows:

* Compute random variable assigning values 0 and 1 randomly with
* equal probability, using a Bernoulli distribution having p=0.5.
COMPUTE random = RV.BERNOULLI(0.5).
EXECUTE.

Subsequently, in construction of a double entered dataset, the rows of data in this family dataset are duplicated (one copy for each twin) and the cases are merged. At this point, the value of the random variable is reversed for the second copy, so that random has opposite values for the two twins in each pair:

* Reverse the value when making the copy of the family dataset for the 2nd twin.
RECODE random (0=1) (1=0).
EXECUTE.

Because random is derived in this way, using a randomised value, the values of random will generally differ between different TEDS datasets, or between different versions of the same dataset. The results of analysis, based on a random filter, may therefore differ slightly if repeated with a new version of a dataset.

Twin ages

All TEDS datasets contain one or more twin age variables, indicating the age at which a specific data collection was made. Such age variables are measured as decimal numbers of years (except in datasets used within the LLC TRE: see below). The syntax below gives a typical example of calculation of twin age variables. The twin birth date is subtracted from the date of data collection, to give an integer number of days; this number is then divided by 365.25 to give a decimal number of years:

* gpbage is the twin pair age when the 7 Year parent booklet was returned.
* derived from the twin birth date (aonsdob) and the booklet return date (gpbdate).
COMPUTE gpbage = (DATEDIFF(gpbdate, aonsdob, "days")) / 365.25 .
EXECUTE.

* gciage1 (later double entered as gciage2) is the age of an individual twin.
* when the 7 Year twin phone interview was carried out, derived from.
* the twin birth date (aonsdob) and the interview date (gcidate1).
COMPUTE gciage1 = (DATEDIFF(gcidate1, aonsdob, "days")) / 365.25 .
EXECUTE.

Twin birth dates, and specific event dates such as data collection dates, are not retained as variables in the datasets, because they would make the data potentially identifiable and would therefore be a risk to participant confidentiality.

The source of many of the date variables used in this calculation is the TEDS admin database; this is where twin birth dates are maintained and where events such as booklet return dates are logged. For some types of data, notably web data, the dates of participation may be recorded in the raw data alongside the phenotypic variables.

Many TEDS studies have included more than one data collection, sometimes administered independently on different dates, and therefore requiring different twin age variables. For example, in the 7 Year study, the parent booklet, twin phone interviews, and teacher questionnaires were collected independently on different dates. There are therefore twin age variables for return of the parent booklet (gpbage), return of each twin's teacher questionnaire (gtqage1/2) and administration of each twin's phone interview (gciage1/2). Refer to the derived variables pages for details of the age variables available in each study.

For data collected on line, typically in web activities, twins may sometimes enter data over a period of many days, especially in a long battery. Some twins may start the activities but may leave them unfinished; hence, in some cases, the start date but not the end date may be recorded. For this reason, the start date is generally used to derived the twin age.

In most questionnaires since age 7, we have recorded the date of return (receipt of the questionnaire) but there is no date recorded in the questionnaire itself - the return date is then used to derive the twin age. However, in the early booklets (up to age 4), multiple dates were recorded by parents within the booklets themselves. In these cases, to avoid a proliferation of different and confusing age variables, a best estimate of completion date is made from the combined dates, and this date is used to derive the age. At ages 2, 3 and 4, the datasets contain three age variables based on different sets of dates: a parent booklet age (e.g. bpbage) an age for completion of parent-administered twin activities (e.g. badage1/2) and an age for completion of parent-reported measures in the twin booklet (e.g. brepage1/2).

LLC ages and dates

The LLC NHS linkage project is described in another page.

There are customised age and date variables that are included in TEDS datasets used within the LLC TRE, together with linked NHS medical records. These replace the usual TEDS ages, used in datasets outside the LLC TRE, as described above.

The LLC require that a 'timestamp' or date variable is included for each TEDS data collection. The function of these variables is to enable sequencing of events in the medical records relative to the dates when various TEDS measures were recorded.

It is TEDS policy not to include precise date and time variables in TEDS datasets, in order to minimise risks of participant identification. Therefore, we have agreed with the LLC that the LLC dataset dates should take the form of the year and month, provided as a string with format 'yyyy-mm', for example '2023-06' representing any date in June 2023. The LLC date variables are named xxxLLCdate, where 'xxx' is the usual name prefix denoting the data collection, for example zmhLLCdate for the TEDS26 mental health questionnaire.

Consistent with rounding all dates to the nearest month, and in agreement with the LLC, the TEDS LLC age variables are measured in integer numbers of months, for example the value 306 representing an age of 25 years and 6 months. The LLC age variables are named xxxLLCage, where 'xxx' is the usual name prefix denoting the data collection, for example zmhLLCage for the TEDS26 mental health questionnaire.

Linked NHS medical records datasets within the LLC TRE are expected to contain a variable denoting year and month of each twin's birth, although at the time of writing it is not know what form this variable will take. The TEDS LLC date and month variables, as described above, should be consistent with this birth date variable. As each birth date and TEDS event date is approximated to the month, omitting the day, the TEDS LLC ages should be accurate to within 2 months for each data collection.

The general derivation of the TEDS LLC date and age variables is shown in SPSS syntax below. Here, the name prefix 'xxx' can be substituted for any of the standard TEDS variable name prefixes for the data collections. Specific examples can be found in the derived variables pages for each TEDS study, in this data dictionary. The variable aonsdob is the twin birth date, which is not retained in TEDS researcher datasets.

* LLC date and age variables.
* Use these only as file 2 variables for the LLC TRE.
* First extract year and month as temp variables, from birth date and activity dates.
COMPUTE xxxyear = XDATE.YEAR(xxxdate).
COMPUTE birthyear = XDATE.YEAR(aonsdob).
COMPUTE xxxmonth = XDATE.MONTH(xxxdate).
COMPUTE birthmonth = XDATE.MONTH(aonsdob).
EXECUTE.

* The agreed date format is a string yyyy-mm.
* adding '0' where necessary for two-digit months.
STRING xxxLLCdate (A7).
IF (xxxmonth < 10) xxxLLCdate = CONCAT(STRING(xxxyear, F4), '-0', STRING(xxxmonth, F1)).
IF (xxxmonth >= 10) xxxLLCdate = CONCAT(STRING(xxxyear, F4), '-', STRING(xxxmonth, F2)).
EXECUTE.

* The agreed LLC age variable is in integer months.
* and it must agree with the birth and booklet year/month variables that will be available in the LLC.
NUMERIC xxxLLCage (F3.0).
COMPUTE xxxLLCage = (xxxmonth + (xxxyear * 12)) - (birthmonth + (birthyear * 12)).
EXECUTE.

School cohort

Since age 7, in many of the TEDS studies, data have been collected annually from sets of participants in their school cohort groups. In some studies, data collection was limited to only one or two school cohorts (for example, at ages 9 and 10 data were only collected from cohorts 1 and 2). Details are described in full in the main page for each TEDS studies.

A cohort variable, with values 1-4, is therefore provided as a background variable where needed. This may be used to distinguish between different waves of data collection. It may also be used to check for cohort effects in analysis.

The UK school year runs from each September until August of the following year. Pupils are placed in school year groups according to their dates of birth, using the date of 1st September as the cut-off. TEDS twins were all born between 1st January 1994 and 31st December 1996 and therefore fall into 4 school year or cohort groups, labelled as cohorts 1, 2, 3 and 4:

Twins born between 1st January and 31st August 1994
Twins born between 1st September 1994 and 31st August 1995
Twins born between 1st September 1995 and 31st August 1996
Twins born between 1st September and 31st December 1996

Hence, for example, those twins grouped in cohort 1 can all be expected to have attended UK school "Year 2", in which they reached the age of 7 years, between September 2000 and August 2001; while those twins in cohort 2 should have attended Year 2 the following year, between September 2001 and August 2002.

The cohort variable is derived as shown in the SPSS syntax below, from the year (aonsby) and month (aonsbm) of each twin pair's birth:

* Derive a school cohort variable from twin birth date.
NUMERIC cohort (F1.0).
IF (aonsby = 1994 & aonsbm < 9) cohort = 1.
IF (aonsby = 1994 & aonsbm >= 9) cohort = 2.
IF (aonsby = 1995 & aonsbm < 9) cohort = 2.
IF (aonsby = 1995 & aonsbm >= 9) cohort = 3.
IF (aonsby = 1996 & aonsbm < 9) cohort = 3.
IF (aonsby = 1996 & aonsbm >= 9) cohort = 4.
EXECUTE.

Variables aonsby and aonsbm, being part of the twin birth date, are not retained in the TEDS datasets in order to ensure that participants are not identifiable.