Article Title

External Validity of Electronic Health Record Studies of Cancer Patients

Publication Date



cancer, observational studies, demographics, racial/ethnic differences in health and health care, epidemiology


Background: Electronic health records (EHRs) from academic and community-based health care systems are increasingly used for epidemiologic and health services research. The external validity of study findings is often unreported, and some question the representativeness of the patient population. We evaluate the generalizability of Sutter Health cancer patients and potential bias in cancer research based on available EHR data.

Methods: We linked the patient population of Sutter Health, a large, multispecialty health care delivery system in Northern California, with the statewide, population-based California Cancer Registry and compared the distributions of demographic, socioeconomic and cancer characteristics for two groups: 1) all Sutter Health members diagnosed with cancer in 2012–2013, and 2) all cancer patients in the 17-county Sutter Health catchment region for the same period. To evaluate potential bias of EHR data, a validation study was conducted to additionally compare those characteristics among Sutter patients who had cancer-related charges or encounters in the Sutter EHR system with the catchment region, also for 2012 and 2013.

Results: 43.1% (N = 69,344) of cancer patients diagnosed from 2012 to 2013 in the catchment region were Sutter patients. Compared with all regional cancer patients, Sutter’s population had proportionally more non-Hispanic whites (70.5% vs 65.2%), had slightly more breast cancer patients (32.9% vs 29.6%) and were more likely to have Medicare (37.2% vs 28.0%), but they were similar in terms of age, gender, socioeconomic status, tumor stage and treatment types. Our validation study showed that 28.5% of 69,344 cancer patients diagnosed during 2012–2013 in the catchment area were Sutter patients with EHR information available. These Sutter patients with EHR information have more comparable distributions to the underlying cancer patient population, with the exceptions of payer source, for which they were more likely to pay with Medicare (38.8% vs 28.8%), and race/ethnicity distribution, as they represented more non-Hispanic whites (71.1% vs 65.2%).

Conclusion: Research based on EHRs from single or integrated health care systems have unknown generalizability. We found that cancer patients from Sutter Health are generally representative of the underlying population, thus, cancer research based on Sutter EHR data can provide good external validity, while minimizing potential biases.