Network Analysis of Recurring YouTube Spam Campaigns
This page contains supplementary material for the papers:
- "Network Analysis of Recurring YouTube Spam Campaigns". D. O'Callaghan, M. Harrigan, J. Carthy, P. Cunningham (2012) [PDF]
- "Identifying Discriminating Network Motifs in YouTube Spam". D. O'Callaghan, M. Harrigan, J. Carthy, P. Cunningham (2012) [PDF]
Like other social media websites, YouTube is not immune from the attention of spammers. In particular, evidence can be found of attempts to attract users to malicious third-party websites. As this type of spam is often associated with orchestrated campaigns, it has a discernible network signature, based on networks derived from comments posted by users to videos. In these papers, we examine examples of different YouTube spam campaigns of this nature, and use a feature selection process to identify network motifs that are characteristic of the corresponding campaign strategies. We demonstrate how these discriminating motifs can be used as part of a network motif profiling process that tracks the activity of spam user accounts over time, enabling the process to scale to larger networks.
Following the lead of earlier related YouTube research, a data set was collected in order to permit the investigation of contemporary spam comment activity. An extensive crawl of the YouTube network was performed by other researchers. In our case, we opted for a specific selection of the available data given that spam comments in YouTube tend to be directed towards a subset of the entire video set, i.e. more popular videos generally have a higher probability of attracting attention from spammers, thus ensuring a larger audience. Another issue to be considered is the accessibility of certain YouTube data attributes. The recent activity of a user profile contains a number of potential attributes for use in the derivation of representative networks, such as comments posted to videos, and subscriptions added to other users. Similarly, the list of subscribers for a particular user would also be useful. However, access to these attributes can often be restricted, meaning that reliance on such data may lead to inaccuracies during subsequent experiments. On the contrary, comments (and the users who posted them) found on a public video's page are always accessible. Given these issues, we decided to use only data to which access was not restricted, namely the comments posted to videos along with the associated user accounts.
The data was retrieved using the YouTube Data API. This API provides access to video and user profile information. There are some limits associated with using the API, of which further details are provided below. Apart from video and user information, access is also provided to standard feeds such as Most Viewed videos, Top Rated videos etc. The fact that these feeds are periodically updated (usually daily) facilitated our objective of analysing recurring spam campaigns, as it enabled the retrieval of popular videos (i.e. those attracting spam comments) on a continual basis. Therefore, the retrieval process was executed periodically as follows:
- Retrieved the current video list from the most viewed standard feed for the US region (the API limited this to a maximum of 100 videos).
- For each video in the list:
- If this video had not appeared in an earlier feed list, retrieved its meta-data such as upload time, description etc.
- Retrieved the comments and associated meta-data for the last twenty-four hours, or those posted since the last retrieval time (if more recent). The API limited the returned comments to a maximum of 1,000.
- In order to track the comment activity on particular videos appearing intermittently in the most viewed feed, comments were also retrieved for those videos not in the current feed list that appeared in an earlier list from the previous forty-eight hours.
Data retrieval ran from October 31st, 2011 to January 17th, 2012:
- Videos: 6,407
- Total comments: 6,431,471
- Comments marked as spam: 481,334
- Total users: 2,860,264
- Spam comment users: 177,542
The data is available in CSV format, where the fields are:
- published - milliseconds since Unix epoch
- videoId - anonymized
- userId - anonymized
- commentText - line break characters have been replaced with '\n' and '\r' as required. User ids have been anonymized.
- spam - True|False, if the comment was marked as spam or not.
The data is available to download for research purposes:
This work is supported by 2Centre, the EU funded Cybercrime Centres of Excellence Network and Science Foundation Ireland under grant 08/SRC/I140: Clique: Graph and Network Analysis Cluster.