"Data entry" is used here as a general term describing the conversion of the original subject responses into usable electronic form. This can involve any or all of the following stages:
- Making an original record of the responses (and possibly transcribing)
- Coding the responses
- Electronic recording of responses in some form of database
The first stage, making the original record, often involves a booklet or questionnaire completed by the respondents themselves; we therefore have little control over this once it reaches TEDS. The second and third stages, coding and entering, are often carried out simultaneously.
The various types of responses in TEDS, and the data entry processes used, are summarised in this table:
|Response Type||Original Record of Responses||Coding||Data Entry Method||File storage||TEDS Examples|
|Event logs and miscellaneous contacts with families||In the TEDS admin database||Coded automatically by database programs||Manually keyed in to database||SQL Server database||Return dates for booklets and questionnaires; updates to family background data|
|Web tests and web/app questionnaires||In relational databases on the web server||Coding for categorical responses is built in to the web programs.||Manually keyed in by respondents themselves||Output from server databases into csv text files||Twin web tests at 10; web version of teacher questionnaire at 12|
|Paper booklets/ questionnaires completed by respondents||Paper booklets/ questionnaires||Categorical responses were numerically coded by data entry staff; in some cases they were automatically coded by database programs||Manually keyed into computer||Access databases||12 year twin booklets|
|Text responses entered verbatim, and subsequently coded by trained TEDS staff||Manually keyed into computer||Access databases, Excel spreadsheets, plain text files||7 year parent occupations; 9 year twin stories|
|Coding (for categorical responses) is built in to the scan setup||Optical scanning||Plain text files with comma-delimited variables||9 year booklets|
|Responses in telephone tests/ questionnaires||Responses were entered electronically during the interview itself by CATI (computer aided telephone interviews)||Some responses were categorized and coded directly by NOP staff; other responses were recorded verbatim and subsequently coded by TEDS staff||Manually keyed into computer||Access databases||Telephone tests administered by NOP in cohort 1 of the 7 year study|
|Categorized and recorded in prepared score sheets; some responses recorded verbatim||Coded by TEDS interviewers after the tests||Manually keyed into computer||Access databases||7 year telephone tests after cohort 1;
12 year TOWRE tests
|Audio recording of interview||Transcribed and coded by trained coders at Yok University||Manually keyed into computer||(not known at time of writing)||14 year language tests|
|Responses in face-to-face tests||Categorized and recorded in prepared booklets||Categories were numerically coded by data entry staff||Manually keyed into computer||Excel spreadsheets or Access databases||Parent-administered PARCA tests at 2, 3 and 4|
|Verbatim responses recorded in prepared score sheets||Coded by TEDS testers, after testing||Manually keyed into computer||Access databases||Some of the in-home visit tests|
Factors Affecting Methods Used
Factors that have affected the choices of data entry methods used in TEDS include the following: the nature of the data; the volume of data; availability of TEDS personnel; availability of electronic and other facilities in TEDS; cost; accuracy. The advantages and disadvantages of the main methods are discussed below.
- Web/app data collection.
The web has been increasingly used since the 10 year study, for twin web tests and for selected questionnaires. From TEDS21, data were additionally collected using a phone app. This type of electronic data collection has several advantages: subjects enter the data themselves, reducing data entry costs; accuracy is high, because web programs can reject or prevent invalid inputs; reduced postage costs; suitability for complex test rules, including branching and discontinuation; ability to incorporate multimedia elements such as graphics, sound and animation. Disadvantages include high cost of set up for web tests (less true for questionnaires); and some respondents have been reluctant to use the web, or did not have easy access. Increasingly, use of web data collection has benefitted from the wide popularity of home computing and broadband connections. Web data are not always perfectly clean, and web data cleaning is addressed in a separate page.
- Optical scanning.
This has been widely used for paper booklets and questionnaires since the 7 year study. Scanning equipment is not available in TEDS, so booklets have been sent to a commercial company (usually Group Sigma) for scanning. This has several advantages: it is ideally suited for categorical responses, if recorded using tick boxes; accuracy is high, as long as the responses are recorded with due care; large volumes of data can be entered quickly. Disadvantages include relatively high cost (overheads are especially high for smaller volumes of data); not well suited for non-categorical data like free text or numeric responses; less accurate if responses are recorded carelessly, especially by younger children.
- Manual keying, in-house.
This has been used throughout all TEDS studies, for updating family background data and logging of things like booklet returns; no other suitable method has been found for these tasks. Manual keying has also been used for entering paper data in cases such as the following: where the volumes of data were low enough to be manageable by TEDS staff; where the responses were recorded carelessly (hence less suitable for scanning); where numeric or text data required interpretation or special coding; or where the data were returned too late to be entered by other means.
- Manual keying, by NOP.
In the early booklet studies, up to age 4, and in the 1st cohort of the 7 year study, NOP Numbers (a commercial company) were employed for the data entry. In the 7 year study, NOP employees carried out the parent and twin telephone interviews, and entered the data directly in electronic form. For the booklet studies, the paper booklets were delivered to NOP for manual keying. The main advantage of this arrangement was that NOP had sufficient personnel to enter the data reasonably quickly; also, the responses in the booklets would not have been suitable for scanning. The disadvantages, which became more apparent over time, were that TEDS had little control over the quality of data entry, and that data errors were quite common; and in the 7 year study, there were concerns over the quality of telephone calls made to some families.
Methods Used in TEDS Studies
|Study||Scanning||Manual data entry||Web/app self-entry|
|7 Year, after cohort 1||
|7 Year, cohort 1||-||
(1996 cohort and most of 1995 cohort)
(1994 cohort and some of 1995 cohort)
Further details of data collection and data entry are described on the main study pages (use links top left of this page). The original raw data files produced from the sources listed above have (in most cases) been retained, unaltered. With the exception of the raw data from the web tests, the cleaned data for each study, from all sources, have been aggregated into a single Access database file. See the data files page for more details.
The aspect of quality control discussed here is the accuracy of the entered data. The original raw data are assumed to be recorded on paper. (Web data do not require data entry, and are not subject to the same quality control issues once the web programs have been tested.)
Prior to data entry, accuracy in recording responses on paper can be encouraged by the use of clear instructions, unambiguous questions, clear formatting and layout, suitable use of tick boxes, and so on. Some errors will inevitably occur when the original responses are recorded on paper, but these are generally beyond control once the data entry stage has been reached.
The choice of data entry method is an important factor affecting accuracy; this has already been discussed above (see Methods Used).
Depending on the method chosen, the way the data entry is set up can affect the accuracy of the entered data. In optical scanning, the initial set-up involves matching each tick box to a numbered position in the electronic file, and assigning a numeric code to each recorded response; a mistake here could result in systematic errors throughout the data, but careful initial checking (with a sample of test data) should ensure that this is 100% accurate. Another aspect of set-up for scanning is the level of reflected light that is registered as a tick/cross in each box; if this is set too high, then fainter ticks/crosses can be missed; if it is set too low, then inadvertent marks on the paper, or even specks of dirt, could wrongly be scanned as ticks. The optimum level is usually left to the expertise of the scanning staff; however, it can be checked again using a suitable sample of test data. Another aspect of scan set-up is the procedure used for "double ticks": for most items, only one response is sought so only one box should be ticked; if more than one box is ticked, the scan is paused and the item is brought up on the scan operator's screen; the operator can then decide which tick is genuine, or what to do if there are two apparently genuine ticks (our usual policy here is to treat the item response as missing).
The set-up for manual keying of data can also help to control errors in the data. For example, it may be possible to program the data-entry software to validate each item of data as it is entered - this is possible in Microsoft Access, for example. Invalid data values are then immediately flagged or rejected, and must be keyed in again. The layout and formatting of the data entry screen are also important, for example in helping the operator to keep track of their position in the data so that the correct response is recorded for the correct item. In TEDS, Microsoft Access forms have been used for manual keying of data, as these forms can lay out the items on screen for one page at a time, and the layout of the form on screen can be designed to follow the layout of the page on paper. Another helpful technique is to require each ID to be keyed in twice, and to require respondent names to be entered as well as IDs, in order to ensure that each case is correctly identified in the data. Access can be programmed to check each ID against a valid list of IDs for families or twins, and can also be programmed to look up names to check against the name recorded on paper.
The main method used for evaluating accuracy in data entry is to carry out checks on random samples of data. This checking is usually done using a visual item-by-item comparison of the original data (on paper) with the entered data (in an electronic file). The number and types of error can then be recorded, giving a numerical measurement of the accuracy level (see Estimating Accuracy below). For optically scanned data, as mentioned above, it is particularly important to check an initial sample of test data in order to verify that the set-up is correct. More generally, for both optical scanning and manual keying, small random samples should ideally be checked throughout the data entry process. With manual data entry, it can be helpful to check a random sample for each operator, to ensure consistency.
Usually, this type of checking has established that the accuracy level is acceptable and that no changes are needed. However, where unacceptable error levels are found, it may be necessary to take one of the following courses of action:
- Start again with a different method of data entry, e.g. change from manual keying to optical scanning. This was done with the parent booklets at age 12.
- Manually check and correct the entire sample of data - this is only feasible for relatively small samples.
- Identify the sub-sample containing most of the problems; re-enter, or check and correct this sample. For instance, some of the 9 year twin booklets were completed in a very careless or untidy way; such booklets were picked out by visual inspection for detailed checking. A problematic sub-sample might also be related to identifiable data entry operators, or to a batch of misprinted booklets, and so on.
- Identify particular items in the booklet that are error-prone; re-enter, or check and correct these items in all the booklets. For instance, the twin heights and weights in the 7 year parent booklet were difficult to scan, and a great deal of checking and correcting was needed for these items. In subsequent booklets, heights and weights were entered manually instead of scanning.
After data entry, the entire file of electronic data can be checked for particular types of error, such as invalid item values, invalid IDs, and booklets with large numbers of missing items. This can be done as a bulk operation, for example using scripts in SPSS or by using queries in Access. This is similar to data cleaning, but if done immediately after data entry it can also help to enable further checks and corrections to be made. Unfortunately, however, this approach does not detect all types of error; for example, it will not detect typographical errors made during manual keying, unless the error results in an invalid item value being entered.
As part of quality control during the data entry process, an attempt is usually made to check a random sample of booklets/questionnaires and to measure the accuracy of data entry. For each sampled booklet, the responses recorded on paper are compared, item-by-item, with the entered responses in the electronic file; a record is kept of the number of errors, and the types of errors (at the same time, the errors can be corrected). By comparing the number of errors with the number of items in a booklet, a percentage error rate can then be calculated. The table below summarises the results from some of these checks in TEDS studies. (Results have not been retained for all studies, and it is not clear whether such checks were done for the early booklet studies.)
|Study/booklet||Data entry method||No. items per booklet||No. booklets checked||No. errors found||% error rate (per item)|
|14 Year twin booklet||Manual keying||142||60||20||0.23%|
|14 Year parent booklet||Optical scanning||215||3||0||0.00%|
|12 Year twin booklet||Manual keying||110||20||1||0.05%|
|12 Year teacher qnrs||Manual keying||90||49||11||0.25%|
|12 Year parent booklets||Manual keying||260||30||48||0.62% *|
|12 Year parent booklets||Optical scanning||260||20||4||0.08% *|
|7 Year parent booklets - numeric data (dates, heights and weights)||Optical scanning||14||100||31||2.21% **|
|7 Year parent booklets - tick boxes||Optical scanning||360||10||2||0.06%|
|7 Year twin score sheets - numeric data (dates, TOWRE scores)||Optical scanning||12||100||8||0.67% **|
|7 Year twin score sheets - tick boxes||Optical scanning||150||10||7||0.47%|
* As a result of these checks, the data entry method was changed from manual keying to optical scanning for the 12 year parent booklets.
** As a result of these high error rates, extensive additional checks were carried out for these data items in all paper copies.
As a rough rule of thumb, based on the levels of accuracy that are achievable in practice, an average error rate of less than 0.5% per item is considered to be acceptable. In other words, the accuracy can be thought of as acceptable if there are, on average, fewer than 5 errors per 1000 items entered. The table above shows that higher levels of accuracy (with an error rate per item of less than 0.1%) can sometimes be achieved using optical scanning if the data are recorded using tick boxes.
During manual keying of data, occasional typographical errors are inevitable. Occasionally, if the operator skips an item on the page or in the entered data, a sequence of consecutive errors can occur. This was observed to be a relatively common problem in the 12 year parent booklet, where there were pages containing long lists of similar items. While it is possible to prevent invalid responses from being keyed in, it is difficult to prevent incorrect but valid responses from being keyed in.
Typographical errors may also occur in scanned data, for two reasons. Firstly, there are some items (including IDs, and free text data) that are not scanned but are typed by the scan operator from a screen. Secondly, where a scan problem occurs (such as two boxes ticked where only one is allowed), the item appears on screen and the operator must judge the correct response and type it in.
Errors can also arise in optical scanning in the following scenarios: (1) a tick/cross is recorded very faintly, e.g. using pencil or coloured pen, and is not detected by the scanner; (2) a tick/cross has carelessly been made just outside the borders of the box, so is not detected; (3) a random mark inside a box, e.g. from a scribble or careless pen stroke, can be recorded as a tick where none really exists. These types of errors have fortunately been rare in the TEDS parent and teacher booklets. However, they were more common in the 9 year twin booklet, as a result of which subsequent twin booklets (at 12 and 14) were manually entered instead of scanned.