Skip to main navigation Skip to search Skip to main content

Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications

  • Dibakar Sigdel
  • , Vincent Kyi
  • , Aiden Zhang
  • , Shaun P. Setty
  • , David A. Liem
  • , Yu Shi
  • , Xuan Wang
  • , Jiaming Shen
  • , Wei Wang
  • , Jiawei Han
  • , Peipei Ping

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline, developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. CaseOLAP has many biomedical applications. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure called Text-Cube, and quantifying phrase-category relationships using the core CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The obtained raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner (processes 100,000 words/sec). Following this protocol, users can access a cloud-computing environment to support their own configurations and applications of CaseOLAP. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.

Original languageEnglish (US)
Article numbere59108
JournalJournal of Visualized Experiments
Volume2019
Issue number144
DOIs
StatePublished - Feb 2019
Externally publishedYes

Keywords

  • cloud computing
  • data science
  • Issue 144
  • medical informatics
  • Medicine
  • phrase mining
  • text mining

ASJC Scopus subject areas

  • General Neuroscience
  • General Chemical Engineering
  • General Immunology and Microbiology
  • General Biochemistry, Genetics and Molecular Biology

Fingerprint

Dive into the research topics of 'Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications'. Together they form a unique fingerprint.

Cite this