Automated Data Extraction from an Integrated Electronic Health Records Data Registry Yields Higher Accuracy Than Manual Entry: Results from the Registry for Stones of the Kidney and Ureter (ReSKU)

View Poster


Electronic health records (EHR) systems are now the standard in a urologist’s documentation of daily patient care encounters. Recently, large data registries have emerged to incorporate important clinical information for quality initiatives. Big data plays a growing integral role in clinical care, as supported by initiatives such as the American Urological Association’s Quality initiative (AQUA). Efficient processes by which this data can be vetted for quality will be critical. The primary objective of this study was to calculate the accuracy of data automatically extracted from EHRs when compared to manually entered data.


As part of the Registry for Stones of the Kidney and Ureter (ReSKU), data is entered into an EHR (Epic) then stored in a searchable, extractable fashion. All patient clinical data has been simultaneously collected and entered manually into a HIPAA-compliant REDCap database since 2015 at UCSF. This allowed for comparison between the automated data extraction with a manual standard to determine input errors. Logistic regression was used to correlate between discrepancies and data type (eg. free-text or multiple choice), clinical encounter type, and the medical record number of patients.


Data for 149 patients were entered into the ReSKU database both manually and via automated extraction. Matching manual entries with digital entries revealed 2,859 discrepancies on a total of 67,441 data-points across four types of clinical encounters, for a discrepancy rate of 4%. Data stored as free-text had 2.1 times the odds of being discrepant when compared to single-answer-multiple-choice (SAM) data (95% CI 1.97, 2.5). When free responses were dropped, accuracy of automated extraction increased from 96% to 97%. Compared to data from the new patient encounter, data from the post-operative clinical encounter had 3.6 times the odds of containing discrepancies (95% CI 3.0, 4.3).


Automated extraction of data from EHRs is possible, even when across multiple clinical encounters and with multiple data types with a high accuracy. Reducing free response entry should be considered when designing integrated database systems. As databases grow in complexity and depth, it will be important to ensure that they do not sacrifice accuracy.

Funding: None