Extraction of Statistical Metadata from Scientific Publications


Approaches for extracting metadata from scientific publications mostly concentrate on detecting standard bibliographic information such as title, author, venue, and publication year. There are also some works that aim at extracting structured data from the publications such as data published in tables. However, further details about the publications are often not extracted such as statistical data.

In this thesis, we aim at developing a rule-based approach for extracting statistical analyses published in scientific papers. The rules are based on standards how statistical results should be published in scientific papers. For experiments, a large-scale dataset of scientific publications with more than 400.000 PDFs will be provided.

In more detail, the work should cover:
- Development of extraction methods for statistical analyses
- Application of extraction methods on large-scale data set
- Evaluation on a suitable sample


- Good programming skills
- Knowledge of basic natural language processing techniques will be an advantage
- Management of large data sets will be necessary

