Article Title

Natural Language Processing of the Unstructured Electronic Health Record Data Using Regular Expressions and SAS Hash Objects

Publication Date



natural language processing, hash tables


Background/Aims: Structured data in the electronic health record only tell part of the story of a patient’s health. Much of the information necessary for health assessment and treatment is located in notes. This information is vital for maintaining care quality as well as for use in research. Manual abstraction of this information is costly and time consuming. The goal of this study was to develop a natural language processing (NLP) system to extract health information from provider notes.

Methods: We have implemented NLP techniques in SAS using hash objects. Our workflow utilizes Excel spreadsheets to manage NLP regular expressions keywords. Keywords for the disease identification as well as keywords for generic terms, family relationships and negation are put into separate spreadsheets and used in the process. The spreadsheets are imported into SAS and loaded into hash tables. Each note is parsed into sentences and standard SAS hash processing is used to find the token and its position in the sentence. If the disease keyword is found, the family, generic and negation hash processes are run. We then utilize each of the token types and the distance from the disease keyword to identify possible instances of the condition.

Results: In a random sample of 1,000 patients, we found 195 patients with diagnosis of diabetes (23% with ICD-9 diagnosis code only, 5% with problem listing only, and 72% with both). NLP identified 11 additional cases without diagnosis code or problem listing mention. Overall, agreement between structured and unstructured data was substantial (kappa=0.66).

Discussion: Preliminary analysis indicates that we can identify additional disease burden in patients even for well-documented conditions like diabetes. NLP is a cost-effective and efficient approach for collecting information using unstructured data. In a health care delivery setting this approach offers the potential for better identification of the patients’ health care needs. It also has promising implications for health research by minimizing misclassification errors and improving case ascertainment. Additional analyses are planned for other conditions that may not be as well documented using structured data.