Article Title

Deafening Silence: When Is Data That’s Not There, Missing?

Publication Date



information technology, virtual data warehouse


Background: The virtual data warehouse group has done good work describing subpopulations with known-compromised data capture. This information is now programmatically available to projects in the form of enrollment variables and has been evaluated once thus far, largely favorably. But that work has been largely descriptive –– data rates have been calculated and evaluated against the weak criterion that rates in the known-compromised group be graphically distinguishable from the rest. A next step for this work is to develop informed expectations about what those rates should be, so we can evaluate capture against those benchmarks. This poster describes one small step in this direction.

Methods: We developed a crude model predicting the number of ambulatory visits (AVs) a given enrollee should have on the basis of age and sex. We then used this model to generate predicted visit counts for several groups of enrollees and compared them to the observed counts. We gathered annual AV count data on a cohort of approximately 200,000 people enrolled in 2013 at Group Health Cooperative, whose data capture was most sure. Like much count data, these counts are incredibly skewed. Rather than attempt a sophisticated statistical model, we decided to use median counts by sex and age category. This should generate very conservative predictions as it basically ignores visits by the numerous “frequent flyer” patients whose counts are well above the median. We then used that model to make predictions of the total number of visits for each of several groups of enrollees in each of the 48 months between 2012 and 2015, and calculated the actual AV counts for comparison.

Results: In 4 of the 6 groups evaluated, predictions were indeed substantial underestimates of the observed number of AVs. In 2 smaller groups, however, predictions actually overestimated the total number of AVs. Given the modest nature of the predictions, these 2 groups bear further investigation for sufficiency of data capture.

Conclusion: While study results are unfortunately ambiguous, the work does demonstrate the value of empirically developing a benchmark against which we can evaluate data capture. Doing so can help focus our attention in data quality improvement efforts.