Article Title

Early Detection of Heart Failure Using Electronic Health Records: Practical Implications for Time Before Diagnosis, Data Diversity, Data Quantity, and Data Density

Publication Date



cardiovascular disease, observational studies, primary care, chronic disease


Background: Using electronic health records (EHR) data to predict events and onset of diseases is increasingly common. Relatively little is known, though, about the tradeoffs between data requirements and model utility.

Methods: We examined the performance of machine learning models trained to detect prediagnostic heart failure in primary care patients using longitudinal EHR data. Model performance was assessed in relation to data requirements defined by: the prediction window length (time before clinical diagnosis), the observation window length (duration of observation prior to prediction window), the number of different data domains (data diversity), the number of patient records in the training data set (data quantity), and the density of patient encounters (data density). A total of 1,684 incident heart failure cases and 13,525 gender-, age-category- and clinic-matched controls were used for modeling.

Results: Model performance improved as: 1) the prediction window length decreases, especially when less than 2 years; 2) the observation window length increases but then levels off after 2 years; 3) the training data set size increases but then levels off after 4,000 patients; 4) more diverse data types are used, but, in order, the combination of diagnosis, medication order and hospitalization data were most important; and 5) data were confined to patients who had 10 or more phone or face-to-face encounters in 2 years.

Conclusion: These empirical findings suggest possible guidelines for the minimum amount and type of data needed to train effective disease-onset predictive models using longitudinal EHR data.