MathAlign: Linking formula identifiers to their contextual natural language descriptions

Maria Alexeeva, Rebecca Sharp, Marco A. Valenzuela-Escárcega, Jennifer Kadowaki, Adarsh Pyarelal, Clayton Morrison

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Scopus citations

Abstract

Extending machine reading approaches to extract mathematical concepts and their descriptions is useful for a variety of tasks, ranging from mathematical information retrieval to increasing accessibility of scientific documents for the visually impaired. This entails segmenting mathematical formulae into identifiers and linking them to their natural language descriptions. We propose a rule-based approach for this task, which extracts LATEX representations of formula identifiers and links them to their in-text descriptions, given only the original PDF and the location of the formula of interest. We also present a novel evaluation dataset for this task, as well as the tool used to create it. The data and the source code are open source and are available at https://osf.io/bdxmr/ and https://github.com/ml4ai/automates, respectively.

Original languageEnglish (US)
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages2204-2212
Number of pages9
ISBN (Electronic)9791095546344
StatePublished - 2020
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: May 11 2020May 16 2020

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
Country/TerritoryFrance
CityMarseille
Period5/11/205/16/20

Keywords

  • Corpus creation
  • Machine reading
  • Math information retrieval
  • Relation extraction
  • Tool creation

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'MathAlign: Linking formula identifiers to their contextual natural language descriptions'. Together they form a unique fingerprint.

Cite this