| Analysis of the Irish Blogosphere |
DataTo produce an initial seed set of blogs, we began with a set of 21 popular blogs within the Irish blogosphere, as indicated by winners of the “2010 Irish Blog Awards”. Starting with this set, we manually extracted blogroll links where a blogroll was available. We repeated this process for two steps out from the seed set. We subsequently manually filtered out blogs that were either password-protected, inactive during the year 2010, aggregation sites, or whose geographical designation did not correspond to Ireland. Finally, we removed blogs with fewer than two in-links (i.e. links coming from blogrolls on other blogs, representing popularity). For the remaining blogs we attempted to retrieve archived blog post content from the Google Blog Search engine - we removed blogs for which posts were no longer available, leaving a core set of 614 blogs. We then extracted post-link connections corresponding to hyperlinks between blogs from the raw blog post HTML sources. Blogroll and post-link network data sets are provided in a number of different common graph formats: >> Download Irish Blogosphere network data In total we retrieved 179k unique blog posts for the core set of 614 blogs. While the vast majority of entries (93%) were published during the period 2007–2011, entries in the collection date back as far as 1997. After extracting text content from the raw HTML, a total of 176k non-empty posts remained. Since we were interested in grouping blogs and bloggers, rather than individual posts, we chose to represent each blog by its content profile – this is defined as the concatenation of the text content from all available posts for that blog. The blog content text data set is provided in sparse, pre-processed format: >> Download Irish Blogosphere text data (13MB) The Delicious tagging data used for validation is also available: >> Download Irish Blogosphere tagging data AcknowledgmentsThis work is supported by Science Foundation Ireland Grant No. 08/SRC/I140 (Clique: Graph and Network Analysis Cluster). Karen Wade acknowledges the support of the IRCHSS GREP scholarship programme for Gender, Identity and Cultural Change.
|
|||
