The linkage sample
The sample of twins whose NHS data are linked within the LLC TRE has been established through fair processing as outlined above. The sample includes twins who meet the following criteria:
- They participated recently in TEDS, so we are confident that their contact details are correct.
- They have been sent the fair processing materials by post (with a postal reminder) and by email (with an email reminder); OR they have given informed consent when interviewed (by E-Risk).
- They have not asked to opt out of the linkage.
- They have not withdrawn from TEDS.
At the time of writing, the sample sent for linkage includes approximately 10,800 individual twins. Roughly 80% of them are paired twins and the remainder are unpaired twins.
The actual number of twins linked to NHS medical records within the LLC TRE may be smaller if some of the twins cannot be identified and linked. Twins who have opted out nationally, independently of TEDS, will also be removed during the linkage process. Once placed in the LLC TRE, all twin records are de-identified, and it is not possible for TEDS staff or researchers to determine which twins have or have not been linked successfully, nor to determine which twins have opted out nationally.
The linked twin sample will change gradually over time. More twins will be sent the fair processing materials if and when a recent contact is established; more twins in the E-Risk sub-study will be asked for consent; and more twins will decide to opt out or withdraw from TEDS. TEDS can send an updated sample for linkage every 3 months. This means that the linkage sample is likely to change at 3-month intervals, although often these changes will be small. The TEDS datasets uploaded to the LLC TRE will change less frequently, but new versions will be made when there have been significant changes to the sample or changes to the underlying TEDS datasets.
Dataset details
Single entry
As described above, the sample of twins whose NHS data are linked, and whose TEDS data will be placed in the LLC TRE, is determined by the fair processing. Twins are removed from the dataset samples if (a) they are not contacted during fair processing, in other words if they are not recent participants with known contact details; (b) if they have opted out of linkage following fair processing; or (c) if they have withdrawn from TEDS at any time.
The TEDS datasets uploaded to the LLC TRE will include only this sample of twins, which is smaller than the sample in conventional TEDS datasets. Those twins not in the sample will have their data removed from the most recent versions of datasets placed in the LLC TRE. In many cases, only one twin from a given pair is selected and may have their data included.
The data for ineligible twins cannot easily be removed from conventional TEDS double-entered datasets, because this would involve removal of twin data from some rows and cotwin data from other rows. Therefore, ineligible twins are conveniently removed by the use of single-entered datasets. Each row of a TEDS single-entered dataset contains data for the twin, but not for the cotwin. The single-entered datasets uploaded to the LLC TRE therefore contain the twin variables having names ending in "1", but not the cotwin variables having names ending in "2".
Note that the linked NHS datasets within the LLC TRE will also be single-entered. If double-entered datasets are required for twin modelling analysis within the TRE, then it will be necessary for the researcher first to merge the required datasets then restructure from single-entry to double-entry. TEDS will be able to assist by providing sample scripts that will achieve this, starting from a dataset containing one row of data per twin.
All data specifically relating to an ineligible twin will be removed from each single-entered TEDS dataset. This will include all data provided by the twins themselves; and data provided by any teacher of the twin in earlier studies; and twin-specific data provided by the twin's parents describing the ineligible twin.
TEDS data provided by parents of the twins, at various ages, will be included in full for twin pairs where both twins are part of the eligible sample. If only one twin is in the eligible sample, then the dataset will include twin-specific variables relating to that twin only, along with variables describing the parents, the home environment or the family as a whole.
Dataset variables
Variables in the TEDS datasets uploaded to the LLC TRE will be fully documented within this data dictionary. Nearly all such variables are identical to those already familiar to TEDS researchers and collaborators from the main TEDS datasets. A small number of new variables are described below.
As mentioned above, the TEDS datasets in the LLC TRE will be single-entered. They will contain variables with data for each included twin (variable names ending in "1") and per-family variables originating from parent questionnaires (variable names ending neither in "1" nor in "2"). However, they will not include variables describing the cotwin of each included twin (variable names ending in "2").
The datasets will additionally include some new variables that have been created or derived especially for use within LLC datasets. These are described in the following table.
Variable | Examples | Coding or values | Description |
---|---|---|---|
STUDY_ID | STUDY_ID | Hashed string values | Unique, pseudonymous twin identifier used in
all TEDS datasets in the LLC TRE, including
the linked NHS datasets. See the
scrambled IDs
page for more information.
The hashing is carried out by the LLC after they have received the TEDS data; the hashing is effectively irreversible. This means that there is no key for changing the STUDY_ID value back into an identifiable ID. The STUDY_ID is pseudonymous in the sense that it has the same values in all datasets, both TEDS datasets and NHS datasets, allowing them to be linked together. |
erisk | erisk | 1=yes, 0=no | Flags whether each twin is a participant in the E-Risk study, whose sample is a subset of the TEDS study sample. This variable is included for the convenience of researchers who wish to use both E-Risk and TEDS datasets linked to NHS data in the LLC TRE. |
LLC age variables | gpbLLCage (twin age when the 7 Year parent booklet was returned); zmhLLCage1 (twin age when the 26 Year MHQ was collected) |
Integer number of months | Twin age in months, for use in LLC datasets, replacing the usual TEDS age variables used elsewhere. See the background variables page for more information. Each such LLC age variable is now documented in the derived variables page for the given study. |
LLC date variables | aLLCdate (year and month when the
1st Contact booklet was returned); pcwebLLCdate1 (year and month when the 16 Year twin web tests were started) |
String in format 'yyyy-mm' | Partial date, in the form of year and month, for use in LLC datasets. Such 'timestamp' variables are required by the LLC for the purpose of sequencing TEDS events (like questionnaire measures) relative to events in the NHS data (like diagnoses). See the background variables page for more information. Each such LLC date variable is now documented in the derived variables page for the given study. |
There are many LLC age and date variables, one for each of the main TEDS data collections. For derivation details, see the derived variables page of the relevant study.
In TEDS datasets uploaded to the LLC TRE, these special age variables (measured in integer months) replace the age variables more usually used in TEDS datasets used in other contexts (where age is measured in decimal years). The ages in months are designed to be compatible with the LLC date variables that provide the month and year of each data collection. The background variables page provides more details about these variables.
Shared datasets
Access to the TEDS datasets and linked NHS data within the LLC TRE is governed by application directly to LLC (https://ukllc.ac.uk/apply), not by application to TEDS. Applications involving the use of TEDS data will be referred by LLC to TEDS for approval. Successful application will further be subject to agreement and signing of contractual documents issued by Bristol University, which hosts the LLC TRE.
When applying to LLC for access to TEDS linked data, an applicant researcher will request both NHS data and TEDS data at the level of available datasets, not at the level of measures or variables within those datasets. On approval, LLC will give the researcher access only to those datasets specified in the application and justified by the research proposal.
The available NHS datasets are not documented in this data dictionary, but will be documented by LLC.
The available TEDS datasets in the LLC TRE will be listed below. In some cases, a TEDS LLC dataset will comprise all measures from a single questionnaire or data collection, for example all measures from the twin phase 1 questionnaire in TEDS21. In other cases, a TEDS LLC may comprise a logical subset of measures from a questionnaire, for example the behaviour measures from the teacher questionnaire at age 7.
The TEDS LLC datasets will be added over a period of time, as needed, hence the entire TEDS dataset will not initially be covered by the available datasets within the LLC. If the data of interest do not appear already to be present in the LLC, a researcher may make a request to TEDS to add the appropriate dataset to the LLC list.
All researchers applying to use the TEDS linked data in the LLC TRE are strongly encouraged to request the generic TEDS dataset of background variables described below. This dataset includes important background variables that will not be duplicated in the other TEDS LLC datasets.
Every TEDS LLC dataset, including the linked NHS datasets, will include the STUDY_ID variable. As mentioned above, this is a unique but de-identified twin identifier that will allow the datasets to be linked or merged together. Its values will be hashed in every dataset uploaded to the LLC TRE in such a way that it retains a unique value for every participant. The hashing is effectively irreversible, which means that values of this ID variable will not allow identification of any participant, either by researchers or by TEDS staff.
Each dataset uploaded to the LLC TRE is limited to a maximum number of 1024 variables. The TEDS LLC datasets will be stripped of alll direct and indirect identifiers, including conventional TEDS ID variables. Identification of individual participants in the LLC datasets will not be possible even by TEDS staff; and attempts at identification will be prohibited by the LLC agreement.
E-Risk study datasets within the LLC, for the E-Risk subsample of TEDS twins, will be documented separately by the E-Risk study. They are not documented in this data dictionary.
The background dataset
The generic background TEDS dataset of background variables should be requested by all researchers applying for access to TEDS data in the LLC. Most of the variables in this dataset are those described on the background variables page:
- randomfamid and twin. These will enable researchers to identify paired twins with a common family identifier, and to double-enter the datasets if needed. Note that randomfamid is randomly and irreversibly anonymised and cannot be converted to an identifiable family ID. Note also that it will be freshly randomised in successive versions of the dataset.
- random, for filtering one twin per pair.
- The twin sex and zygosity variables.
- The standard exclusion variables.
- The standard genotypic covariates, for use with any polygenic scores included in the customised dataset.
The background dataset will include the following additional variables from the 1st Contact dataset (parent-reported) and from the 26 Year MHQ dataset (twin self-reported):
- Ethnic origin: aethnic and zmhethnic1.
- Parent SES: ases and its components amohqual, afahqual, amosoc, afasoc, amagechl.
- Twin SES: zmhses1 and its components zmhhqual1, zmhempinc1, zmhecvul1.
- Language spoken at home, at time of 1st contact: alang.
The background dataset will also include the following special LLC variables, as described above on this page:
- STUDY_ID
- erisk
- aLLCdate, the month and year when 1st Contact data were collected.
List of LLC TEDS datasets
The list below will be updated as more TEDS datasets are uploaded to the LLC TRE.
The dataset name forms part of a longer file name as required by LLC.
The full file name in every case will take this form:
TEDS_[dataset_name]_v[version_number]_[yyyymmdd].sav.
For example, TEDS_background_v001_20240507.sav. They are uploaded as
SPSS data files, complete with variable labels and value labels where needed,
hence the .sav file suffix. The date [yyyymmmdd] at the end of the file name
shows when each dataset version was made.
LLC dataset name | TEDS dataset source | Measures included | Reference links in this data dictionary |
---|---|---|---|
background | Variables from admin sources, 1st Contact and TEDS26. Contains widely-used background, exclusion and demographic variables. Should be used with all other LLC datasets. |
SES and ethnic origin (1st Contact and TEDS26), twin sex, zygosity, exclusions, genotypic covariates, twin pair variables. | See the background dataset above. See also the background variables page. |
26yr_mhq | The twin TEDS26 questionnaire, also called the mental health questionnaire (MHQ). | All measures from this questionnaire. | 26 Year study |
21yr_twin_phase1 | The twin TEDS21 phase 1 questionnaire. | All measures from this questionnaire. | 21 Year study |
21yr_twin_phase2 | The twin TEDS21 phase 2 questionnaire. | All measures from this questionnaire. | 21 Year study |