Article Title

Data-Driven Modeling of Electronic Health Record Data to Predict Prediagnostic Heart Failure in Primary Care

Publication Date



electronic health record, machine learning


Background/Aims: Electronic health records (EHRs) represent an opportunity for real-time early risk prediction. EHRs were used to determine the extent to which machine learning tools and a purely data-driven approach to modeling could detect heart failure (HF) subtypes, i.e. preserved ejection fraction (HFpEF) and reduced ejection fraction (HFrEF), 12 months before a clinical diagnosis.

Methods: Incident HF cases were identified from Geisinger Clinic primary care patients aged 50–85 years, diagnosed between 2001 and 2010 and further defined as HFrEF if left ventricular ejection fraction ≤ 40 and HFpEF if left ventricular ejection fraction > 50. Controls were chosen to match HF cases by age, gender, location and primary care physician. EHR data were extracted on demographics, ICD-9 codes, medication orders, clinical and behavioral measures. Modeling was completed to detect HF using data from a 24-month observation window 12 months before the HF diagnosis. Patient feature vectors were generated from the data and summarized by one or more aggregation functions (e.g. counts, means). For the HF endpoint, modeling was done with and without patients who had an acute coronary syndrome event within 12 months of the diagnosis. Regularized logistic regression was applied using information gain feature selection and 10-fold cross validation. Model performance was assessed by area under the receiver operating characteristic curve and complexity by the number of selected features.

Results: Performance for HFpEF was better than for HFrEF. The HFpEF model was more complex than the HFrEF model as indicated by more EHR information that was needed to discriminate the HFpEF cases from controls. Performance with and without acute coronary syndrome cases was similar, though models including acute coronary syndrome cases were more complex than models excluding them.

Conclusion: Purely data-driven approaches to modeling can be used to detect HF 12 months before clinical diagnosis. Model performance and complexity varies with the HF subtypes, indicating differences in the complexity of modeling the HF subtypes.