Abstract
We describe an approach to text classification that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called “information extraction” as a basis for high-precision text classification. We present three algorithms that use varying amounts of extracted information to classify texts. The relevancy signatures algorithm uses linguistic phrases; the augmented relevancy signatures algorithm uses phrases and local context; and the case-based text classification algorithm uses larger pieces of context. Relevant phrases and contexts are acquired automatically using a training corpus. We evaluate the algorithms on the basis of two test sets from the MUC-4 corpus. All three algorithms achieved high precision on both test sets, with the augmented relevancy signatures algorithm and the case-based algorithm reaching 100% precision with over 60% recall on one set. Additionally, we compare the algorithms on a larger collection of 1700 texts and describe an automated method for empirically deriving appropriate threshold values. The results suggest that information extraction techniques can support high-precision text classification and, in general, that using more extracted information improves performance. As a practical matter, we also explain how the text classification system can be easily ported across domains.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 296-333 |
| Number of pages | 38 |
| Journal | ACM Transactions on Office Information Systems |
| Volume | 12 |
| Issue number | 3 |
| DOIs | |
| State | Published - Jan 7 1994 |
| Externally published | Yes |
Keywords
- information extraction
- text classification
ASJC Scopus subject areas
- Information Systems
- General Business, Management and Accounting
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Information Extraction as a Basis for High-Precision Text Classification'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS