Analysis of the Irish Blogosphere
This page contains supplementary material for the paper:
K. Wade, D. Greene, C. Lee, D. Archambault, P. Cunningham. (2011), "Identifying Representative Textual Sources in Blog Networks". Proc. AAAI ICWSM 2011. [PDF] [BibTeX]
In this study we apply methods from social network analysis and visualization to facilitate a study of the Irish blogosphere from a cultural studies perspective. We focus on solving the practical issues that arise when the goal is to perform textual analysis of the corpus produced by a network of bloggers. Previous studies into blogging networks have noted difficulties arising when trying to identify the extent and boundaries of these networks. As a response to calls for increasingly data-led approaches in media and cultural studies, we discuss a variety of social network analysis methods that can be used to identify which blogs can be seen as members of a posited "Irish blogging network". We identify hub blogs, communities of sites corresponding to different topics, and representative bloggers within these communities. Based on this study, we propose a set of analysis guidelines for researchers who wish to map out blogging networks.
To produce an initial seed set of blogs, we began with a set of 21 popular blogs within the Irish blogosphere, as indicated by winners of the “2010 Irish Blog Awards”. Starting with this set, we manually extracted blogroll links where a blogroll was available. We repeated this process for two steps out from the seed set. We subsequently manually filtered out blogs that were either password-protected, inactive during the year 2010, aggregation sites, or whose geographical designation did not correspond to Ireland. Finally, we removed blogs with fewer than two in-links (i.e. links coming from blogrolls on other blogs, representing popularity). For the remaining blogs we attempted to retrieve archived blog post content from the Google Blog Search engine - we removed blogs for which posts were no longer available, leaving a core set of 614 blogs. We then extracted post-link connections corresponding to hyperlinks between blogs from the raw blog post HTML sources.
Blogroll and post-link network data sets are provided in a number of different common graph formats:
In total we retrieved 179k unique blog posts for the core set of 614 blogs. While the vast majority of entries (93%) were published during the period 2007–2011, entries in the collection date back as far as 1997. After extracting text content from the raw HTML, a total of 176k non-empty posts remained. Since we were interested in grouping blogs and bloggers, rather than individual posts, we chose to represent each blog by its content profile – this is defined as the concatenation of the text content from all available posts for that blog.
The blog content text data set is provided in sparse, pre-processed format:
The Delicious tagging data used for validation is also available:
This work is supported by Science Foundation Ireland Grant No. 08/SRC/I140 (Clique: Graph and Network Analysis Cluster). Karen Wade acknowledges the support of the IRCHSS GREP scholarship programme for Gender, Identity and Cultural Change.