TextLuas

This page contains supplementary materials for the paper TextLuas: Tracking and Visualizing Document and Term Clusters in Dynamic Text Data. D. Greene, D. Archambault, P. Cunningham (2010).

Description

TextLuas is a system for computing and visualizing dynamic clusters identified in large dynamic text datasets. The novel aspects of this system include a model for tracking the cluster life cycle events, a co-clustering algorithm to compute individual time step graphs, and techniques for visualizing the evolution of dynamic clusters, both from the perspective of life cycle events and cluster contents.

Further details on the dynamic co-clustering algorithm used to cluster individual time step graphs is available as a technical report:

Spectral Co-Clustering for Dynamic Bipartite Graphs. D. Greene, P. Cunningham (2010). Tech. Rep. UCD-CSI-2010-05, School of Computer Science & Informatics, UCD.

Datasets

We provide two new dynamic text datasets for dynamic clustering tasks.

The first dataset consists of a set of bookmarks from the Del.icio.us web portal, originally collected Görlitz et al. The subset here covers the 2,000 top tags and 5,000 top sites across an eleven month period from January to November 2006. This data is divided into 44 weekly time step graphs.

>> Download Delicious Top 5000 Bookmark Collection (14MB)

The second dataset is a collection of articles relating to economic news from three online news sources (RTE, The Irish Times, The Irish Independent), which were previously collected for the purpose of sentiment classification. The data is pre-processed so that news articles are represented based solely on a collection of named entities from a manually-curated list. The data was then divided into 36 time steps, each one week in duration. Each step graph contained approximately 604 articles described by 597 terms (entities).

>> Download Irish Economic News Entity Dataset (August 2009 to April 2010) (1MB)

The two datasets are provided in sparse matrix form. The data has already been pre-processed and divided into time steps. Each time step graph is stored via three files:

*.mtx: Term frequencies stored in a sparse term-document matrix in Matrix Market format.
*.fids: List of terms in the corpus, with each line corresponding to a row of the corresponding term-document matrix.
*.ids: List of document identifiers, with each line corresponding to a column of the corresponding term-document matrix.

Software

The TextLuas clustering tool is written in Java 1.6. A cross-platform binary distribution, containing instructions and sample files, is available here. It can be applied to the corpora provided above:

>> Download TextLuas Clustering Tool - Cross-Platform Java Binary (3MB)

The TextLuas visualization tool is written in C++, using the QT framework and the Tulip graph toolkit. Currently stand-alone binaries for 64-bit Fedora Linux-like are available. Other versions will appear here shortly:

>> Download TextLuas Visualization Tool - Linux 64-bit Binary (6MB)

The results described in the paper, produced on the two corpora above, are also available to download. The *.timeline files can be used in conjunction with the TextLuas visualization tool above:

>> Download Result Input Files for TextLuas Visualization Tool

Contact

For further information please contact Derek Greene.