MovieLists Dataset
This page contains supplementary material for the paper:
D. Greene, P. Cunningham. (2013), "Producing a Unified Graph Representation from Multiple Social Network Views". ACM Web Science 2013. [Short PDF] [Extended PDF]
Summary
In many social networks, several different link relations will exist between the same set of users. Additionally, attribute or textual information will be associated with those users, such as user-generated content or demographic details. For many data analysis tasks, such as community finding and data visualisation, the provision of multiple heterogeneous types of user data makes the analysis process more complex. We have developed an unsupervised method for integrating multiple data views to produce a single unified graph representation, based on the combination of the k-nearest neighbour sets for users derived from each view. These views can be either relation-based or feature-based. The proposed method is evaluated on a number of annotated multi-view Twitter datasets listed below, where it is shown to support the discovery of the underlying community structure in the data.
Data
For evaluation purposes, we collected five topical Twitter datasets, based on curated user lists, for which a manually-curated ground truth set of communities is available. The datasets are as follows:
- football: A collection of 248 English Premier League football players and clubs active on Twitter. The disjoint ground truth communities correspond to the 20 individual clubs in the league.
- olympics: A dataset of 464 users, covering athletes and organisations that were involved in the London 2012 Summer Olympics. The disjoint ground truth communities correspond to 28 different sports.
- politics-uk: 419 Members of Parliament (MPs) from the United Kingdom. The ground truth consists of five groups, corresponding to political parties.
- politics-ie: A collection of Irish politicians and political organisations, assigned to seven disjoint ground truth groups, according to their affiliation.
- rugby: A collection of 854 international Rugby Union players, clubs, and organisations active on Twitter. The ground truth consists of communities corresponding to 15 countries. The communities are overlapping, as players can be assigned to both their home nation and the nation in which they play club rugby.
Download
We make the five datasets available for further non-commercial and research purposes only. They are provided in pre-processed matrix format only. To comply with the Twitter TOS, we do not include any raw tweets or other full text content. Users and user lists are referenced by their unique Twitter IDs, as opposed to full names or screen names.
The datasets are provided in a single archive. Each dataset is contained within its own sub-directory, and 9 different "views" or criteria of each dataset are provided in sparse matrix representation. For a dataset <name>, the view files have the following prefixes:
- <name>-follows, <name>-followedby, <name>-mentions, <name>-mentionedby, <name>-retweets, <name>-retweetedby, <name>-listmerged500, <name>-lists500, <name>-tweets500
- The file <name>.ids contains the list of all user IDs in the dataset.
- The file <name>.communities contains the ground truth community information, with one community per line.