Data Analyst Critical Path Institute Foster City, California, United States
Background: Evaluating repurposed drug effectiveness and safety is crucial to advance drug development for diseases of high unmet clinical need. CURE-ID, a web-based platform, facilitates drug repurposing by crowdsourcing real-world off-label drug use experiences from healthcare providers through case reports. Leveraging topic modeling, an unsupervised natural language processing (NLP) modeling technique, we aim to discover abstract “topics” that occur in a collection of documents and extract insights from diverse disease-related clinician notes by developing a preliminary data filtering mechanism to identify relevant information to extract. Methods: Topic modeling was implemented on unstructured clinician notes and literature abstracts from over 300 diseases within the CURE-ID app using Latent Dirichlet Allocation (LDA) and BERTopic. The data were tokenized and lemmatized using the gensim library (LDA) and nltk processing library (BERTopic) in Python. Results obtained from the optimized models were compared both qualitatively through examination and quantitatively through the CV coherence score which measures relatedness and interpretability of keywords within topics. The keywords obtained from both models were summarized using ChatGPT to get topic names for all fields. Results: An example of qualitative comparison between topics generated from LDA and BERTopic is in table 1. Comparatively, the resulting coherence scores between the two topic modeling methods were a mean of 0.41 with a minimum of 0.34 and a maximum of 0.57 for LDA, and 0.73 with a minimum of 0.66 and a maximum of 0.77 for BERTopic. The results indicate that both the models were able to generate meaningful topics, however BERTopic outperformed LDA, which can be attributed to its capability to consider the semantic meaning and context of the text to identify topics. LDA relies on bag-of-word representation, which treats words as independent entities without context and semantic information. Conclusion: Topic modeling effectively identifies and organizes topics in unstructured data to focus on specific research questions. In this study, we successfully applied and compared two topic modeling approaches to extract relevant information from diverse unstructured clinician notes and literature for downstream drug repurposing applications.