#g20 Tweets

alex — Tue, 06 Oct 2009 17:18:36 +0000

I‘ve created this blog entry mainly as a way of providing access to some files related to the work Maria Garrido & I have been doing on the twitter conversation surrounding the G-20 meeting in Pittsburgh in September of 2009.

Briefly, our aim was to examine Tweets that included the #g20, and figure out how leadership structure may have emerged. However, the data may be useful to others as well.

If you make use of the materials you’ve found here, please cite this web page, including both Alex Halavais and Maria Garrido as authors.

Tweets and RT net

The core data is simply a collection of tweets that were collected using The Archivist based on a search for the hashtag #g20 from midnight, September 20 to midnight September 29, 2009. The Archivist should be able to store this in both XML and CSV formats, but for some reason the CSV seemed not to work every time. The files were also collected in various overlapping chunks, in order to provide redundancy, and so then needed to be merged without duplicates. This zip file includes that data:

g20.zip [1.8 Mb]

Note that the tweets in g20tweets.csv is not sorted in any way. It also lacks a header line. The items are:

User, date/time, tweet ID, user image, status

The python script (warning: IANAP) used to munge this together is included, along with a .net file of the re-tweet network that can then be loaded and massaged in Pajek, and the script used to maket that file.

Linked

Anything from the g20-tweets.csv that started with http:// was stored (using get-http.py), and wget used to archive a copy of it. In total, 6,653 different URLs were extracted. Note that wget does not retreive flash video, and so it’s likely this was lost. (There were claims in the Tweets that YouTube was removing videos of police actions.) Other sites may have been unreachable. Finally, URL shorteners no doubt meant that the same site was posted under a range of different URLs.

g20tweetweb.zip [108 Mb]

Note, I haven’t really even looked at that archived material yet, so, it’s As-Is.

Update 7/12/2012: Please note that I’ve removed the above links in case they clash with new Twitter terms of use around sharing twitter data sets. These were collected before any such changes occurred, but being responsive also to privacy concerns. If you are a researcher interested in the data, please just contact me.

Update 4/12/2013: Congratulations to Jennifer Earl and crew on an interesting piece of research that incorporates these data: This protest will be tweeted: Twitter and protest policing during the Pittsburgh G20.

My Web Personality

alex — Sun, 23 Aug 2009 17:42:07 +0000

This is from an art installation that appeared at the MIT Museum. It grabs information from the web and classifies the keywords. I’m not at all sure how it thinks I’m a big sports fan–I can’t imagine what words I use that are “sporty”!–but as the write-up suggests “It is meant for the viewer to reflect on our current and future world, where digital histories are as important if not more important than oral histories, and computational methods of condensing our digital traces are opaque and socially ignorant.”

Check out how the internet sees you.

[the making of, pt. 3] Creating a sample

alex — Mon, 22 Sep 2008 18:02:38 +0000

(This is the third in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Having put together the bones of a literature review in part 2, I now need to collect some data. My initial aim was a round number: 2,000 users, selected at random. There isn’t any particular reason for that number, other than it is small enough to handle, but large enough to lend a bit of power when it comes to the stats. It turned out that I actually ended up having to sample far more (30K), to get enough commenters, but didn’t know that until after I started. Research is messy.

Programming… shudder

Broadly, I wanted to collect a sample of individual user’s histories on Digg. This is made much easier because Digg, like many web services today, provides an API or “application programming interface,” a way of tapping into the databases that drive the system. The good news is that this makes it much easier to access the data if you know a little bit about programming. So, the unfortunate first step of this is learning to program, just a little, or engaging in enough “social engineering” to get someone who can program to help you. If you’ve done anything advanced in SPSS, SAS, Excel, or the like, you probably have done some programming without knowing it, but even if you’ve never done any programming before, taking this one step will help you way beyond this project. You don’t even have to be a good programmer (I’m not), you just need to be able to get the computer to do what you want it to. Really, programming is just a matter of being very explicit about what you want to do, and then putting that in a standardized format (computer language).

To that end, the first step I took was to download a Python installer (for free) here. One of the nice things about Python is that it is easy to learn, but powerful enough for people to do just about anything they want to. For what follows, I assume a very basic understanding of Python. If you already have some programming experience, see Mark Pilgrim’s Dive Into Python. There are lots of intros out there if you are just getting started: peruse your local bookstore.

Generally, when a service offers an API, someone will put together a tool (a “wrapper”) that makes it easy to access in Python and other popular programming languages. Derek van Vliet has written such a package for the Digg API, and so I downloaded it (PyDigg), again, for free, and put it in my Python directory.

A sample

I needed a sample of 2,000 user names. Like many systems, Digg appears to associate an incremental user ID with with the user name. So if you look at the code of my user page, you will find a snippet of code that reads:

suggesting that I am the 242,530th person to sign up for a Digg account. This is great: it means that if there were a way to get userpages by number, I could just keep feeding that in and end up with a sample. Unfortunately, I’m not sure that is possible (or at least I didn’t immediately see a way to do it through the API).

We also need to know how many users there are if we are going to sample them. Luckily, that’s pretty easy to find out here. As of the time of this posting, that number is just shy of three million. So, if we take random users from that list, we should end up with a workable sample. In practice, we may want to limit the sample to particular registration dates, or users that have a certain minimum of activity, but for now, we’ll just try for a straight sample of all registered users.

The script

The PyDigg site gives a short sample script showing how it works, and a glance through its code provides a good indication of the sorts of things it can provide for you. Here is the script I used to assemble a sample of 2,000 users, with some comments as we go:

from digg import *
from random import *
from time import *

We start by opening up three libraries that are needed for our task. The Digg library, obviously, gives us access to the Digg API, “random” gives us a function that produces random numbers, and we’ll use a function from “time” to slow down our requests to Digg, and not hog the resource.

TOTALUSERS = 2902633
APPKEY = 'http%3A%2F%2Falex.halavais.net'
SAMPLESIZE = 2000
PAUSE = 60
FILENAME = 'sampleusers.txt'

We start out with some constants. We could automatically get the total number of users via the API, but really, since this is a one-off script, that’s not necessary: easy enough to do it as a human, looking the the XML returned. The APPKEY is an application key that tells Digg who you are. Usually, it is a page for the program that is running. In this case, it is just a link to my blog, assuming that Digg can track me down that way. Please don’t use that for your own program :). The SAMPLESIZE is how many users I want to collect, and the PAUSE is the number of seconds I want it to idle between calls. Digg doesn’t give a hard number here, just asks you to be polite. A full minute may be overkill, but since you are going to wait anyway, why not just let it run overnight? Finally, FILENAME is where you want all this stuff stored.

Under other conditions, you would probably set up something to read this from the command line, get it from an initiation file, or query the user for it. But since I am going to be the only one using the program,and only using it once, I just hard code in the constants.

print "starting..."
d = Digg(APPKEY)
thesample = []

We start out with some basics. As with all the “prints” in the script, this one just lets us know what’s going on. The d= statement creates an instance of the Digg object; something that happens with any script that uses the PyDigg library so that you can access it. We also create an empty list of objects that will be our sample.

while len(thesample) < SAMPLESIZE:	
  cursor = int(random()*TOTALUSERS)	
  print "Getting data from offset ", cursor, '... ',
  users = d.getUsers(count=1, offset=cursor)

Here we begin the loop that makes up the program as a whole. As long as the length of our list (thesample)–which you will recall starts out at length zero–is less than the number we want (2000), we keep executing the instructions below. First we assign a random number from 0 to the total number of users to the variable “cursor”. Then, we use the “getUsers” method to pull one user record from that offset number. In practice, it would probably be good to have some error checking and exception handling here, but again–if we are only ever going to run this once or twice, that is probably overkill.

  if users[0] not in thesample:
    print "writing...",
    thesample.append(users[0])
    f = open(FILENAME,'a')
    f.write(users[0].name)
    f.write(',')
    if users[0].registered:
      f.write(users[0].registered)
    else:
      f.write('0') 
    f.write('\n')				# and go to the next line.
    f.close()

OK, this chunk writes out the record we have snatched, assuming we don’t already have it (we don’t want repeats in the sample). You might reasonably ask why we don’t just write it all out at the end, since we’re keeping a list of users in “thesample.” We could do that, but if for some reason the power goes out overnight, or you hit an error, or your cat hits CTRL-C, you have lost all that data. So, since we are taking our time anyway, why not do complete writes each time?

If this user object isn’t in our list (thesample) we add it in. We then open up our file to append a new line. The getUsers method returns a list of users; in our case, a list that includes only one object, but a list nonetheless. So we are concerned with the “zero-th” item on that list, and the “name” portion of this, which we write out to the file. We follow that with a comma (all the better to read in later, and I am assuming that commas are not allowed in usernames), and if the object has it, the date of registration. Then we close the file up, to make sure that nothing gets lost in a buffer if there is an abnormal termination.

  print "pausing...",
  sleep(PAUSE)
  print len(thesample), "finished."

We take a short breather, and print out the length of our sample thus far, before going back up and starting the loop again, assuming that we haven’t yet met our goal.

print "done."

Just in case it isn’t obvious.

In practice, again, things were a bit messier, and you may find the above doesn’t work out exactly for you. For example, you probably want to set your default character set to unicode, and do a little error checking. But in spirit, the above shows how easy it is to get stuff.

I won’t go into the process of pulling the comments into a comma-separated file, since it is largely identical to the above process. If you want to do this, take a look at the objects in the Digg library, and see what sorts of things they return.

From here

Having collected a lot of data, we need to figure out how to work it into something usable. Next (in part 4), I will talk a little bit of ways of dealing with all of this data and munging it into formats that are useful.

Web Analysis – A Thaumaturgical Compendium

#g20 Tweets

Tweets and RT net

Linked

My Web Personality

[the making of, pt. 3] Creating a sample