I‘ve created this blog entry mainly as a way of providing access to some files related to the work Maria Garrido & I have been doing on the twitter conversation surrounding the G-20 meeting in Pittsburgh in September of 2009.
Briefly, our aim was to examine Tweets that included the #g20, and figure out how leadership structure may have emerged. However, the data may be useful to others as well.
If you make use of the materials you’ve found here, please cite this web page, including both Alex Halavais and Maria Garrido as authors.
Tweets and RT net
The core data is simply a collection of tweets that were collected using The Archivist based on a search for the hashtag #g20 from midnight, September 20 to midnight September 29, 2009. The Archivist should be able to store this in both XML and CSV formats, but for some reason the CSV seemed not to work every time. The files were also collected in various overlapping chunks, in order to provide redundancy, and so then needed to be merged without duplicates. This zip file includes that data:
g20.zip [1.8 Mb]
Note that the tweets in g20tweets.csv is not sorted in any way. It also lacks a header line. The items are:
User, date/time, tweet ID, user image, status
The python script (warning: IANAP) used to munge this together is included, along with a .net file of the re-tweet network that can then be loaded and massaged in Pajek, and the script used to maket that file.
Anything from the g20-tweets.csv that started with http:// was stored (using get-http.py), and wget used to archive a copy of it. In total, 6,653 different URLs were extracted. Note that wget does not retreive flash video, and so it’s likely this was lost. (There were claims in the Tweets that YouTube was removing videos of police actions.) Other sites may have been unreachable. Finally, URL shorteners no doubt meant that the same site was posted under a range of different URLs.
g20tweetweb.zip [108 Mb]
Note, I haven’t really even looked at that archived material yet, so, it’s As-Is.