[the making of, pt. 3] Creating a sample

(This is the third in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Having put together the bones of a literature review in part 2, I now need to collect some data. My initial aim was a round number: 2,000 users, selected at random. There isn’t any particular reason for that number, other than it is small enough to handle, but large enough to lend a bit of power when it comes to the stats. It turned out that I actually ended up having to sample far more (30K), to get enough commenters, but didn’t know that until after I started. Research is messy.

Programming… shudder

Broadly, I wanted to collect a sample of individual user’s histories on Digg. This is made much easier because Digg, like many web services today, provides an API or “application programming interface,” a way of tapping into the databases that drive the system. The good news is that this makes it much easier to access the data if you know a little bit about programming. So, the unfortunate first step of this is learning to program, just a little, or engaging in enough “social engineering” to get someone who can program to help you. If you’ve done anything advanced in SPSS, SAS, Excel, or the like, you probably have done some programming without knowing it, but even if you’ve never done any programming before, taking this one step will help you way beyond this project. You don’t even have to be a good programmer (I’m not), you just need to be able to get the computer to do what you want it to. Really, programming is just a matter of being very explicit about what you want to do, and then putting that in a standardized format (computer language).

To that end, the first step I took was to download a Python installer (for free) here. One of the nice things about Python is that it is easy to learn, but powerful enough for people to do just about anything they want to. For what follows, I assume a very basic understanding of Python. If you already have some programming experience, see Mark Pilgrim’s Dive Into Python. There are lots of intros out there if you are just getting started: peruse your local bookstore.

Generally, when a service offers an API, someone will put together a tool (a “wrapper”) that makes it easy to access in Python and other popular programming languages. Derek van Vliet has written such a package for the Digg API, and so I downloaded it (PyDigg), again, for free, and put it in my Python directory.

A sample

I needed a sample of 2,000 user names. Like many systems, Digg appears to associate an incremental user ID with with the user name. So if you look at the code of my user page, you will find a snippet of code that reads:

<input type="hidden" id="userid" value="242530" />

suggesting that I am the 242,530th person to sign up for a Digg account. This is great: it means that if there were a way to get userpages by number, I could just keep feeding that in and end up with a sample. Unfortunately, I’m not sure that is possible (or at least I didn’t immediately see a way to do it through the API).

We also need to know how many users there are if we are going to sample them. Luckily, that’s pretty easy to find out here. As of the time of this posting, that number is just shy of three million. So, if we take random users from that list, we should end up with a workable sample. In practice, we may want to limit the sample to particular registration dates, or users that have a certain minimum of activity, but for now, we’ll just try for a straight sample of all registered users.

The script

The PyDigg site gives a short sample script showing how it works, and a glance through its code provides a good indication of the sorts of things it can provide for you. Here is the script I used to assemble a sample of 2,000 users, with some comments as we go:

from digg import *
from random import *
from time import *

We start by opening up three libraries that are needed for our task. The Digg library, obviously, gives us access to the Digg API, “random” gives us a function that produces random numbers, and we’ll use a function from “time” to slow down our requests to Digg, and not hog the resource.

TOTALUSERS = 2902633
APPKEY = 'http%3A%2F%2Falex.halavais.net'
SAMPLESIZE = 2000
PAUSE = 60
FILENAME = 'sampleusers.txt'

We start out with some constants. We could automatically get the total number of users via the API, but really, since this is a one-off script, that’s not necessary: easy enough to do it as a human, looking the the XML returned. The APPKEY is an application key that tells Digg who you are. Usually, it is a page for the program that is running. In this case, it is just a link to my blog, assuming that Digg can track me down that way. Please don’t use that for your own program :). The SAMPLESIZE is how many users I want to collect, and the PAUSE is the number of seconds I want it to idle between calls. Digg doesn’t give a hard number here, just asks you to be polite. A full minute may be overkill, but since you are going to wait anyway, why not just let it run overnight? Finally, FILENAME is where you want all this stuff stored.

Under other conditions, you would probably set up something to read this from the command line, get it from an initiation file, or query the user for it. But since I am going to be the only one using the program,and only using it once, I just hard code in the constants.

print "starting..."
d = Digg(APPKEY)
thesample = []

We start out with some basics. As with all the “prints” in the script, this one just lets us know what’s going on. The d= statement creates an instance of the Digg object; something that happens with any script that uses the PyDigg library so that you can access it. We also create an empty list of objects that will be our sample.

while len(thesample) < SAMPLESIZE:	
  cursor = int(random()*TOTALUSERS)	
  print "Getting data from offset ", cursor, '... ',
  users = d.getUsers(count=1, offset=cursor)

Here we begin the loop that makes up the program as a whole. As long as the length of our list (thesample)–which you will recall starts out at length zero–is less than the number we want (2000), we keep executing the instructions below. First we assign a random number from 0 to the total number of users to the variable “cursor”. Then, we use the “getUsers” method to pull one user record from that offset number. In practice, it would probably be good to have some error checking and exception handling here, but again–if we are only ever going to run this once or twice, that is probably overkill.

  if users[0] not in thesample:
    print "writing...",
    thesample.append(users[0])
    f = open(FILENAME,'a')
    f.write(users[0].name)
    f.write(',')
    if users[0].registered:
      f.write(users[0].registered)
    else:
      f.write('0') 
    f.write('\n')				# and go to the next line.
    f.close()	

OK, this chunk writes out the record we have snatched, assuming we don’t already have it (we don’t want repeats in the sample). You might reasonably ask why we don’t just write it all out at the end, since we’re keeping a list of users in “thesample.” We could do that, but if for some reason the power goes out overnight, or you hit an error, or your cat hits CTRL-C, you have lost all that data. So, since we are taking our time anyway, why not do complete writes each time?

If this user object isn’t in our list (thesample) we add it in. We then open up our file to append a new line. The getUsers method returns a list of users; in our case, a list that includes only one object, but a list nonetheless. So we are concerned with the “zero-th” item on that list, and the “name” portion of this, which we write out to the file. We follow that with a comma (all the better to read in later, and I am assuming that commas are not allowed in usernames), and if the object has it, the date of registration. Then we close the file up, to make sure that nothing gets lost in a buffer if there is an abnormal termination.

  print "pausing...",
  sleep(PAUSE)
  print len(thesample), "finished."

We take a short breather, and print out the length of our sample thus far, before going back up and starting the loop again, assuming that we haven’t yet met our goal.

print "done."

Just in case it isn’t obvious.

In practice, again, things were a bit messier, and you may find the above doesn’t work out exactly for you. For example, you probably want to set your default character set to unicode, and do a little error checking. But in spirit, the above shows how easy it is to get stuff.

I won’t go into the process of pulling the comments into a comma-separated file, since it is largely identical to the above process. If you want to do this, take a look at the objects in the Digg library, and see what sorts of things they return.

From here

Having collected a lot of data, we need to figure out how to work it into something usable. Next (in part 4), I will talk a little bit of ways of dealing with all of this data and munging it into formats that are useful.

This entry was posted in General and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

WordPress › Error

There has been a critical error on this website.

Learn more about troubleshooting WordPress.