TEDS Data Dictionary

The Zygosity Algorithm

Contents of this page:


The TEDS zygosity questionnaire was included in the 1st Contact, 3 Year and 4 Year parent booklets. The questionnaire, and the algorithm to derive zygosity from the results, are identical in the three booklet versions. Tom Price devised the algorithm and implemented it using SAS syntax for the early versions of the datasets. Current dataset versions are created in SPSS, so the algorithm has been translated into SPSS syntax (shown below).

The most reliable estimates for zygosity can be made from twin sexes (DZ if opposite sex) and from DNA zygosity tests (which have only been done for a minority of TEDS families). For same-sex twin pairs where DNA test results are not available, the best estimate of zygosity is the algorithm result from the booklets. Where the three booklets disagree, priority is given to the most recent positive result, e.g. the 4 year result is given precedence over the 3 year result, provided that the 4 year result was positive (MZ or DZ), not unknown.

Variables used in the Algorithm

The following variables from the zygosity questionnaire are all coded in such a way that a higher score indicates greater dissimilarity between the twins. In some cases, the raw items have been recoded to achieve this pattern of coding.

The table shows variable names in the 1st Contact dataset, but the variables are the same in the 3 Year and 4 Year datasets, except for a difference in naming (variable name prefixes are 'c' for 3 Year and 'd' for 4 Year instead of 'a' for 1st Contact).

No.QuestionVariable name (1st Contact) Coding
1Has a health professional ever told you they are MZ or DZ? azyprof1 = MZ, 2 = DZ
2Do you think they are MZ or DZ?azyyou1 = MZ, 2 = DZ
3Differences in shade of hairazyhairs1 = none, 2 = slight, 3 = clear difference
4Differences in texture of hairazyhairt1 = none, 2 = slight, 3 = clear difference
5Differences in eye colourazyeyes1 = none, 2 = slight, 3 = clear difference
6Differences in ear-lobe shapeazyears1 = none, 2 = slight, 3 = clear difference
7Did twins' teeth come through at the same time azyteet21 = matching teeth on same or different sides within few days, 2 = different teeth came through within a few days of each other, or teeth did not come through within days of each other
8Likeness between twins as they became olderazyold 1 = greater, 2 = same, 3 = less
9Can you tell twins apart from a new photo? azyphot21 = confuse them, 2 = yes, but hard, 3 = yes easily
10Blood group difference (if known)azytyp 0 = no, 1 = yes
11Blood rhesus factor difference (if known)azyfac 0 = no, 1 = yes
12Difficulty telling them apart (other parent) azymistp1 = often, 2 = sometimes, 3 = rarely/never
13Difficulty telling them apart (other siblings) azymists1 = often, 2 = sometimes, 3 = rarely/never
14Difficulty telling them apart (other relatives) azymistr1 = often, 2 = sometimes, 3 = rarely/never
15Difficulty telling them apart (day carer/baby-sitter) azymistb1 = often, 2 = sometimes, 3 = rarely/never
16Difficulty telling them apart (parents' close friends) azymistf1 = often, 2 = sometimes, 3 = rarely/never
17Difficulty telling them apart (parents' casual friends) azymistc1 = often, 2 = sometimes, 3 = rarely/never
18Difficulty telling them apart (people meeting for first time) azymistm1 = often, 2 = sometimes, 3 = rarely/never
19Are twins mistaken for each other when together azytoget1 = often, 2 = sometimes, 3 = almost never, 4 = never
20Likeness between the twinsazypeas1 = alike as two peas in a pod, 2 = alike as other sibs, 3 = not alike at all

Description of the Algorithm

The 20 variables listed above are summed to make a raw difference score, with higher values indicating more differences between the twins. To take account of missing values in some variables, a theoretical maximum score is computed as the highest possible score from the non-missing items for each twin (this is 54 if all items are non-missing). The raw difference score is then divided by the theoretical maximum score for each twin, effectively re-scaling the score to decimal values in the range 0 to 1. This gives a 'zygosity index' score, in variables atempzyg, ctempzyg, dtempzyg at 1st Contact, 3 Year and 4 Year respectively. To ensure that this zygosity index score is reliable, a minimum of half the items are required to be non-missing: the theoretical maximum is required to be at least 27. The zygosity index score has a bimodal distribution, with the two modes representing MZ and DZ twin pairs (low and high scores respectively). The algorithm then classifies index scores as follows:

  • Less than 0.64: MZ
  • Between 0.64 and 0.70: indeterminate (mid-range)
  • Greater than 0.70: DZ

This result from the index score is then over-ruled if any of the following item responses apply:

  1. DZ if they have clear eye colour differences (azyeyes = 3)
  2. DZ if they have clear hair shade differences (azyhairs = 3)
  3. DZ if they have clear hair texture differences (azyhairt = 3)
  4. DZ if they are said not to look much alike at all (azypeas = 3)
  5. MZ if they are said to be as alike as two peas in a pod (azypeas = 1)

If there is a conflict between the latter item (alike as two peas in a pod) and any one of the first three items (clear differences in eye colour, hair shade or hair texture), then the zygosity is classified as indeterminate (inconsistent).

The variables containing zygosity estimates based solely on the algorithm are aalg2zy, calg2zy, dalg2zy at 1st Contact, 3 Year and 4 Year respectively. Taking twin sexes into account, opposite-sex pairs being classified as DZ and over-ruling the algorithm results, the zygosity variables are aalgzyg, calgzyg, dalgzyg at 1st Contact, 3 Year and 4 Year respectively. The coding of all these zygosity variables is:

  • 1=MZ
  • 2=DZ
  • 5=indeterminate (mid-range zygosity index scores)
  • 99=indeterminate (inconsistent item responses)

1st Contact algorithm syntax

This syntax is used in the construction of the 1st Contact dataset, in SPSS. The syntax used in the 3 Year and 4 Year datasets is identical, except that the variable names are changed (variable name prefixes are changed from 'a' at 1st Contact to 'c' at 3 Year or 'd' at 4 Year).

The end results of this syntax are two estimates of zygosity: aalg2zy without reference to twin sexes, and aalgzyg with reference to known twin sexes.

* Compute a difference sum, from ordinal variables with higher values = more different.
COMPUTE sumzyg = SUM(azyprof, azyyou, azyhairs, azyhairt, azyeyes, azyears,
 azyteet2, azyold, azyphot2, azyfac, azytyp, azymistp, azymists, azymistr,
 azymistb, azymistf, azymistc, azymistm, azytoget, azypeas).
* Determine maximum possible score, depending on number of non-missing.
* responses in the above variables (total is 54 if none missing).
COUNT zyg1 = azyfac azytyp (0 thru 1).
COUNT zyg2 = azyprof azyyou azyteet2 (1 thru 2).
COUNT zyg3 = azyhairs azyhairt azyeyes azyears azyold azyphot2 azymistp
 azymists azymistr azymistb azymistf azymistc azymistm azypeas (1 thru 3).
COUNT zyg4 = azytoget (1 thru 4).
COMPUTE zygtot = SUM(zyg1, (2 * zyg2), (3* zyg3), (4 * zyg4)).
* Can now re-scale the difference score to range 0-1.
* requiring at least half the data to be non-missing.
* (total possible score must be 27 or higher).
IF (zygtot >= 27) atempzyg = sumzyg / zygtot.

* Now zygosity from algorithm can be derived.
* Start with default value of 5 (indeterminate).
COMPUTE aalgzyg = 5.
* Now use the difference score: 0.64 or less means MZ.
* 0.70 or more means DZ.
IF (atempzyg <= 0.64) aalgzyg = 1.
IF (atempzyg >= 0.70) aalgzyg = 2.
* Now over-rule the score and conclude DZ if there are clear differences.
* in eye colour, hair shade or hair texture, or if they look very different. 
IF (azyhairs = 3 | azyhairt = 3 | azyeyes = 3 | azypeas = 3) aalgzyg = 2.
* Also over-rule score if alike as two peas in a pod (conclude MZ).
IF (azypeas = 1) aalgzyg = 1.
* But if the latter clashes with clear differences in hair/eyes.
* then the result is inconsistent (value 99).
IF ((azypeas = 1) & (azyhairs = 3 | azyhairt = 3 | azyeyes = 3)) aalgzyg = 99.

* Copy the result into a second variable representing the derived.
* zygosity without reference to information about twin sexes.
* This will be used for admin purposes, to track changes in estimated.
* zygosity for pairs where the twin sexes are updated.
COMPUTE aalg2zy = aalgzyg.

* Check whether twins have different sexes.
COMPUTE sexdif = ABS(sex1 - sex2).

* Finally, for aalgzyg but not aalg2zy, over-rule all other data.
* if twins have opposite sexes (DZ) or sexes are unknown.
IF (sexdif = 1) aalgzyg = 2.
IF (SYSMIS(sexdif)) aalgzyg = 5.

Making the best estimate of zygosity

The zygosity variable used in TEDS dataset is zygos, which is the best available estimate of zygosity derived from all the available evidence:

  • Twin sexes, as recorded in the admin database
  • DNA zygosity test results, also recorded in the admin database
  • Zygosity algorithm results, as described above, from the 1st Contact, 3 Year and 4 Year booklets.

The logical rules used to make the best estimate of zygosity are as follows, in order of precedence:

  1. If the twins have opposite sexes, they are DZ regardless of other results.
  2. For same-sex twin pairs:
    1. If a conclusive DNA test result is available, this is the best estimate of zygosity
    2. If no DNA test result is available:
      1. If the 4 Year booklet zygosity algorithm gave a conclusive result, then this is the best estimate of zygosity.
      2. If the 4 Year algorithm result is unavailable, or was inconclusive, use the 3 Year booklet zygosity algorithm result.
      3. If the 4 Year and 3 Year algorithm results are both unavailable or incconclusive, use the 1st Contact booklet zygosity algorithm result.

Hence, if the 1st Contact, 3 Year and 4 Year booklet algorithm results are contradictory, the most recent available of these results is taken to be the most reliable. This is based on the assumption that differences between twins in DZ pairs may become easier for parents to distinguish with increasing age. If a DNA zygosity test result is available, then this is used as the best estimate for a same-sex pair even if it contradicts booklet zygosity algorithm results. For opposite-sex twin pairs, an assumption of DZ zygosity is made even in rare cases where this may be contradicted by other results.

Syntax for best estimate of zygosity

The SPSS syntax below is used in the construction of the TEDS datasets. It shows how a best estimate of zygosity (variable zygos) is made using all available sources of data: the zygosity algorithm results from 1st Contact, 3 Year and 4 Year booklets; DNA zygosity test results, if available, from admin data; and twin sexes also from admin data.

For more information about zygosity variables in the dataset, see the background variables page.

* Start with DNA zygosity (already coded 1=MZ 2=DZ).
COMPUTE zygos = DNAzygos.

* Ensure zygosity is DZ if twins have opposite sexes.
* which will often be the case where no DNA test was done.
* (if there is a discrepancy with the DNA test, opposite-sexes takes higher priority).
DO IF ((sex1 = 1 & sex2 = 0) | (sex1 = 0 & sex2 = 1)).
  RECODE zygos (ELSE=2).

* If neither of the above applies (no DNA result, same-sexes).
* and zygos still has a missing value, then use the zygosity algorithm results.
* Highest-priority of these is the 4 year booklet result, dalg2zy.
* but only use it if the result was conclusive (coded 1=MZ/2=DZ not 5 or 99).
IF (SYSMIS(zygos) & ANY(dalg2zy, 1, 2)) zygos = dalg2zy.
* Next in priority order is the 3 year booklet result, calg2zy, using the same logic.
IF (SYSMIS(zygos) & ANY(calg2zy, 1, 2)) zygos = calg2zy.
* Last in priority order is the 1st contact booklet result, aalg2zy, using the same logic.
IF (SYSMIS(zygos) & ANY(aalg2zy, 1, 2)) zygos = aalg2zy.

* Treat zygosity as unreliable if either twin sexes is unknown.
* (which is very unlikely if the results above are non-missing).
DO IF (SYSMIS(sex1) | SYSMIS(sex2)).
* This is now the best estimate of zygosity.