Knowledge Discovery Group

Vec4IR framework for practical information retrieval using word embedding

Vec4IR is a python framework designed to simulate a practical information retrieval setting using word embeddings. The framework provides native support for word embeddings with gensim (http://radimrehurek/gensim). Further key featurs are a built-in evaluation routine, an API design that is inspired by sklearn (scikit-learn.com), and the extensibility of the Vec4IR framwork by new retrieval models. Several embedding-based retrieval models are natively included:

  • Word centroid similarity: The cosine similarity between centroid of the document’s word vectors and the centroid of the query’s word vectors.
  • IDF re-weighted word centroid similarity: The cosine similarity between the document centroid and the query centroid after re-weighting the terms by inverse document frequency.
  • Word Mover’s distance: The cost of moving from the words of the query to the words of the matched documents is minimized.
  • Doc2Vec inference: The matched documents are ranked according to the cosine similarity between inferred vectors of the query and the documents. 

Regarding the retrieval models' evaluation, the following metrics are considered: Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Precision, Recall, and F1-score. 

In order to encourage research in this field, the focus lies on extensibility. Adding a new retrieval model for evaluation should be as easy as possible. The target audience are researchers evaluating their own retrieval models and curious data scientists, who want to evaluate which retrieval model fits their data best.

For the source code and more information, please refer to the repository on GitHub: https://github.com/lgalke/vec4ir

Publications

L. Galke, A. Saleh, and A. Scherp: Word Embeddings for Practical Information Retrieval, INFORMATIK 2017 – WS34 Deep Learning in heterogenen Datenbeständen, Gesellschaft für Informatik e.V., 2017.

News

  • Homepage kicked off!