Article Title

A Lightweight Text Mining Tool for Multisite Research

Publication Date



natural language processing, information extraction


Background/Aims: The use of electronic medical records provides researchers and care administrators with access to increased patient information. Much of the information, however, is available only as unstructured data in clinical notes, which limits its accessibility. Various natural language processing systems seek to extract information from the clinical notes and present it as structured data, thereby increasing the data’s usability for research, care administration and quality assurance. Many of the available open-source tools, however, require nontrivial installation and configuration. In addition, customizations, debugging and portability can be challenging for some users.

Methods: We developed a lightweight Python application that performs basic information extraction tasks, but that is easy to install, configure, customize and share. The application relies on a dictionary to extract content from unstructured data. The dictionary is a mapping of a set of concepts (e.g. “hypoesthesia”) to words and phrases that characterize how that concept would appear in clinical text (e.g. “decreased sensation,” “impaired sensation,” etc.). The application converts these words and phrases to regular expressions and searches the text for matching content, attempting to detect variations in spelling and to recognize when the word or phrase is qualified by negation, uncertainty, historical reference or a reference to someone other than the patient.

Results: We tested the application using a dictionary of 792 entries on a set of 205,748 clinical notes. The application completed the information extraction task in less than two hours. We used the more comprehensive Apache Clinical Text Analysis and Knowledge Extraction System (cTAKES) as a comparison using the same dictionary and same notes. We stopped cTAKES after 8.5 days with 144,701 (68.9%) complete.

Discussion: The application provides a quick means to access unstructured data with minimal configurations and environmental settings. Multiple sites can use the same algorithm simply by sharing the dictionary and a configuration file.