Knowledge Discovery Group

Optimizing Caching and Crawling Strategies for Stream-based SchemEX Computation

Description

SchemEX is a stream-based approach to compute an schema index over Linked Open Data (LOD) [1]. The data stream is generated by an RDF crawler harvesting triples from the semantic web. So far, SchemEX uses a FIFO queue as cache on the stream of RDF triples to extract schema information from the crawled resources. The strategy of the RDF crawler so far is not considered at all.

Different caching strategies on a given data stream influence the quality of the resulting schema index. Likewise a guidance of the crawler or the provision of a more suitable crawling strategy might be favourable for a better index quality. The task would be to develop, implement and evaluate different strategies for caching and crawling in the SchemEX scenario.

In more detail, the work should cover:
- Development of caching strategies
- Development of crawling strategies/guidance
- Incorporation of the strategies in the existing system used for computing SchemEX
- Evaluation on a suitable corpus

This work will be carried out in collaboration with Thomas Gottron from the University of Koblenz-Landau.

Requirements

- Good programming skills
- Knowledge of Semantic Web techniques are of advantage
- Management of large data sets will be necessary

[1]  Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16: 52-58 (2012) http://www.sciencedirect.com/science/article/pii/S1570826812000716

Got interested? Send an email!

News

  • Homepage kicked off!