Mass Data Massage: An automated data processing system used for NHEXAS, Arizona

Mary Kay O'Rourke, Luis M. Fernandez, Clinton N. Bittel, Jared L. Sherrill, Tony S. Blackwell, D. Royce Robbins

Research output: Contribution to journalArticlepeer-review

12 Scopus citations


Data entry and management are critical components of all large survey projects; data quality objectives must be met and data must be quickly and readily accessible. We developed a comprehensive system for data entry and management utilizing scannable forms with bubble fields and handwriting recognition. This 'Mass Data Massage' (MDM) system had three components: (1) form creation and database definition; (2) programming of data dictionaries for documentation and preliminary logic and range checks; and (3) data entry, management and documentation using the 'Mass Data Cleaning Program' (MDCP). Scannable forms were written in Teleform, whew the data field definition, variable names and ranges were defined as the form was created. Completed forms were returned from the field, subjected to final field quality control (QC) checks, and transferred to the data management section. They were batched and coded as necessary. Once a batch of data was scanned and visually verified, the operator called up the menu for the MDCP. The MDCP had 31 program modules with 500-1200 lines of code each. The operator could select and run the appropriate dictionary on each data batch 'correcting' apparent errors in responses. This process was iterative until the data batch passed all dictionary checks. Proposed 'changes' were forwarded to the data coordinator (DC) for acceptance or rejection. After all errors had been resolved, each data batch was subjected to a 10% quality assurance (QA) check. The original data batch and associated file of applied changes were archived. Time expenditure using the scanning approach varied with the number of questions and the types of responses (handwritten or bubble fields). One- page forms took 42-60% of the time needed for hand entry; forms longer than 10 pages took 35-38% of the time. Use of faster machines will further speed the process. The main advantage of the system was the reduction of systematic errors. Scanning alone reduced errors found on 995 NHEXAS Baseline Questionnaires. Overall, the dictionary identified 0.55% errors on the scanned forms. Ten percent QC checks, performed on corrected batches ready for appendage to the master database, revealed an overall error rate of 0.02%. Similar checks on a laboratory form scanned from numeric handwriting detected 0.3% errors following dictionary application and 0.2% errors during the 10% QA check. This system was faster, more accurate, and mow cost- effective than hand entry of data. A batch of data that took >1 week to process using the hand entry method was processed within I day using MDM. Human coding of specific answers and the final verification were the most time-consuming processes.

Original languageEnglish (US)
Pages (from-to)471-484
Number of pages14
JournalJournal of Exposure Analysis and Environmental Epidemiology
Issue number5
StatePublished - 1999
Externally publishedYes


  • Automated data entry
  • Data Quality Assurance Program
  • Data dictionaries
  • Data management
  • Epidemiology
  • Handwriting recognition
  • Scanning data

ASJC Scopus subject areas

  • Environmental Chemistry
  • Toxicology
  • General Environmental Science
  • Pollution
  • Public Health, Environmental and Occupational Health
  • Health, Toxicology and Mutagenesis


Dive into the research topics of 'Mass Data Massage: An automated data processing system used for NHEXAS, Arizona'. Together they form a unique fingerprint.

Cite this