Analysis of the Irish Blogosphere

This page contains supplementary materials for the ICWSM 2011 paper "Identifying Representative Textual Sources in Blog Networks". K. Wade, D. Greene, C. Lee, D. Archambault, P. Cunningham (2011) [PDF] [BibTeX]

Summary: In this study we apply methods from social network analysis and visualization to facilitate a study of the Irish blogosphere from a cultural studies perspective. We focus on solving the practical issues that arise when the goal is to perform textual analysis of the corpus produced by a network of bloggers. Previous studies into blogging networks have noted difficulties arising when trying to identify the extent and boundaries of these networks. As a response to calls for increasingly data-led approaches in media and cultural studies, we discuss a variety of social network analysis methods that can be used to identify which blogs can be seen as members of a posited "Irish blogging network". We identify hub blogs, communities of sites corresponding to different topics, and representative bloggers within these communities. Based on this study, we propose a set of analysis guidelines for researchers who wish to map out blogging networks.

Irish Blogosphere

Data

To produce an initial seed set of blogs, we began with a set of 21 popular blogs within the Irish blogosphere, as indicated by winners of the “2010 Irish Blog Awards”. Starting with this set, we manually extracted blogroll links where a blogroll was available. We repeated this process for two steps out from the seed set. We subsequently manually filtered out blogs that were either password-protected, inactive during the year 2010, aggregation sites, or whose geographical designation did not correspond to Ireland. Finally, we removed blogs with fewer than two in-links (i.e. links coming from blogrolls on other blogs, representing popularity). For the remaining blogs we attempted to retrieve archived blog post content from the Google Blog Search engine - we removed blogs for which posts were no longer available, leaving a core set of 614 blogs. We then extracted post-link connections corresponding to hyperlinks between blogs from the raw blog post HTML sources.

Blogroll and post-link network data sets are provided in a number of different common graph formats:

>> Download Irish Blogosphere network data  

In total we retrieved 179k unique blog posts for the core set of 614 blogs. While the vast majority of entries (93%) were published during the period 2007–2011, entries in the collection date back as far as 1997. After extracting text content from the raw HTML, a total of 176k non-empty posts remained. Since we were interested in grouping blogs and bloggers, rather than individual posts, we chose to represent each blog by its content profile – this is defined as the concatenation of the text content from all available posts for that blog.

The blog content text data set is provided in sparse, pre-processed format:

>> Download Irish Blogosphere text data (13MB)  

The Delicious tagging data used for validation is also available:

>> Download Irish Blogosphere tagging data   

Acknowledgments

This work is supported by Science Foundation Ireland Grant No. 08/SRC/I140 (Clique: Graph and Network Analysis Cluster). Karen Wade acknowledges the support of the IRCHSS GREP scholarship programme for Gender, Identity and Cultural Change.