EVE: Explainable Word Embeddings Using Wikipedia

This page contains supplementary material for the paper:
M. Atif Qureshi, Derek Greene. "EVE: Explainable Vector Based Embedding Technique Using Wikipedia" , 2017 (Under review).

Summary

This research involves the development unsupervised explainable word embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a word using human-readable labels, thereby making the vector readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link structure. To test the effectiveness of the proposed word embedding model, we consider its usefulness in three fundamental tasks: 1) intruder detection --- to evaluate its ability to identify a non-coherent vector from a list of coherent vectors, 2) ability to cluster --- to evaluate its tendency to group related vectors together while keeping unrelated vectors in separate clusters, and 3) sorting relevant items first --- to evaluate its ability to rank vectors (items) relevant to the query in the top order of the result. For each task, we also propose a strategy to generate a task-specific human-interpretable explanation from the model.

Data

To evaluate the performance of different word embedding models, we constructed a new dataset from a 2015 dump of Wikipedia, which is composed of seven different topical types, each containing at least five sub-topical categories. On average each sub-topical category contains a list of 20 items or concepts. The usefulness of the dataset lies in the fact that the organization from topics to categories to items is made on the bases of factual position.

Download

>> Pre-trained word embedding models (EVE, Word2Vec, FastText, GloVe) for the above Wikipedia dataset is available here.

>> Download visualizations — Visualizations of the seven topic types in the dataset, produced using t-SNE.

Software

>> A Python reference implementation to query and analyze the above dataset is available here.

Contact

For further information please contact Dr. M. Atif Qureshi.