Knowledge Discovery Group

Automatic Detection and Separation of Scholarly Figures from PDFs


One of the most common file types for sharing documents are PDFs. Many scientific publications are published in this format. Although the documents themselves present their content in a structured manner their internal structure is very variable and unstructured. There are some tools that are capable of extracting embedded images. In scientific publications, there are a lot of scholarly figures but there are also some documents which include a lot of other images due to format issues (whole page scans, draft-logo, etc.). In addition, there are scholarly figures of very low quality. Given large corpora of documents with in total more than 300.000 documents, it is unfeasible to extract the useful images by hand.

This thesis will consist of developing a classification scheme for images extracted from scientific publications to separate high quality scholarly figures from the other images. Different large-scale datasets of scientific publications with in total more than 300.000 PDFs in English language will be provided for evaluation purposes.

In more detail, the work should cover:
• Development of features for classification of images into high-quality scholarly figures
• Application of classification methods on large-scale data set
• Evaluation on a suitable sample


• Good programming skills
• Knowledge of image analysis techniques will be an advantage
• Management of large data sets will be necessary

Got Interested? Send an email!


  • Homepage kicked off!