Knowledge Discovery Group

Text Extraction from Scholarly Figures

Scholarly figures are data visualizations like bar charts, pie charts, line graphs, maps, scatter plots or similar figures. Text extraction from scholarly figures is useful in many application scenarios, since text in scholarly figures often contains information that is not present in the surrounding text.

Design and Implementation

We derived a generic pipeline for text extraction from the analysis of the wide research area on text extraction from figures and implemented in total over 20 methods for the six sequential steps of the pipeline. A binary of our Java implementation can be downloaded here as well as Tesseracts Language Data.

Our implementation is based on the streams framework and uses Tess4J to access the Tesseract OCR engine and PDFBox to process PDFs.
All libraries that are used in our implementation are provided separately as JAR files. They are published under GNU or Apache license (see the respective websites for more information). 

We provide a set of possible configurations and an instruction manual:

Quick Start Guide to use our Pipeline

Documentation (v0.6) of the Pipeline Interfaces

Example Configurations

Datasets

We additionally provide four datasets for evaluation:

  • EconBiz: A corpus of 121 scholarly figures from the economics domain. We randomly extracted these figures from a corpus of 288,000 open access publications from EconBiz. The dataset resembles a wide variety of scholarly figures from bar charts to maps. We manually labeled the figures to create the gold standard.
  • DeGruyter: This dataset only consists of the ground truth information for 120 figures from academics books provided by DeGruyter. A ReadMe file is included pointing to the images for which the ground truth was created. Most of the figures are from the chemistry domain.
  • CHIME-R: A set of 115 real images which were collected on the Internet or scanned from paper. Most of the figures are bar charts with few pie charts and line charts. The gold standard for this dataset was created by Yang Li.
  • CHIME-S: A set of 85 synthetically generated images. This set mainly contains line charts and pie charts and few bar charts. The gold standard was created by Zhao Jiuzhou.

 

The CHIME datasets were created by the Center for Information Mining and Extraction (CHIME), School of Computing, National University of Singapore and the original datasets can be found here.

We adjusted the provided gold standard to have a uniform format for all datasets. Each figure is accompanied by a TSV file (tab-separated values) where each entry corresponds to a text line which has the following structure:

  • X-coordinate of the center of the bounding box in pixel
  • Y-coordinate of the center of the bounding box in pixel
  • Width of the bounding box in pixel
  • Height of the bounding box in pixel
  • Rotation angle around its center in degree
  • Text inside the bounding box

In addition we provide the ground truth in JSON format. A schema file is included in each dataset as well. Furthermore, each dataset is accompanied with a ReadMe file with further information about the figures and their origin.

 

 

Publications:

Böschen, F. & Scherp, A.
Amsaleg L., Guðmundsson G., Gurrin C., Jónsson B., Satoh S. (Ed.)
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
Proceedings Part I, Multimedia Modeling - 23rd International Conference MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Springer, 2017, 15-27
[RAW extraction results]

Böschen, F. & Scherp, A.
Bergmann, R.; Görg, S. & Müller, G. (Ed.)
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics
Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB,
Trier, Germany, October 7-9, 2015., CEUR-WS.org,
2015, 1458, 20-31
[
Fulltext]

Böschen, F. & Scherp, A.
Vanoirbeek, C. & Genevès, P. (Eds.)
Multi-oriented Text Extraction from Information Graphics
Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, September 8-11, 2015, ACM, 2015, 35-38
[Fulltext]

News

  • Homepage kicked off!