Identifying Marijuana, Nicotine and Electronic Cigarette References Within a Free-Text Data Field in the Electronic Medical Record Using SAS
marijuana, natural language processing
Background/Aims: Marijuana is becoming increasingly decriminalized and legalized across the country, with no sign of abatement. Electronic cigarettes are rapidly increasing in popularity. Nicotine is a common substance utilized in the process to reduce dependency on tobacco products. Data on these behaviors are most frequently recorded in the form of free-text comment fields, and are subject to variations in spelling, abbreviation and slang. There is a growing need to be able to identify references to these behaviors in the medical record.
Methods: Using simple SAS text-string functions, variant references can be identified. A few specific “short-hand” text strings were used to capture all variations. A “short-hand string” is a series of characters one would expect to find in a word, such as “mari” and “juan” in “marijuana.” Ten such strings were identified and tested for marijuana, six for electronic cigarettes and three for nicotine. When a string was identified within a field, an attempt was made to capture the entire word in which it was embedded. A manual review of this isolated candidate text was then performed to verify the integrity of the contribution for each string.
Results: Over 370 variations and misspellings of “marijuana,” not including slang and cannabinoid references; over 25 variations of “electronic cigarettes;” and over 50 variations of “nicotine” have been identified. Some oversampling is expected. For example, “juan” might identify a proper noun such as San Juan Hospital; “mari” might identify marinol, an opioid but not marijuana in a conventional sense; and “huan” identified ma huang, which is a Chinese equivalent to ephedra. However, isolating the candidate text strings reduced the volume of distinct strings requiring manual review to a manageable level.
Discussion: A library of known misspellings can be employed to identify variants, but utilizing specific short-hand strings will also identify new candidates for consideration. Overinclusiveness is expected owing to idiosyncratic text entry, but isolating the candidate text will offer manageable lists for manual review when determining inclusion or exclusion of the results. Utilizing this series of short-hand strings may obviate the need to maintain an ongoing compendium of misspellings and variation.
Folck BF, Waring SC, Mack CD, Cleveland C. Identifying Marijuana, Nicotine and Electronic Cigarette References Within a Free-Text Data Field in the Electronic Medical Record Using SAS. J Patient Cent Res Rev 2015;2:118. http://dx.doi.org/10.17294/2330-0698.1146