[the making of, pt. 3] Creating a sample

(This is the third in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Having put together the bones of a literature review in part 2, I now need to collect some data. My initial aim was a round number: 2,000 users, selected at random. There isn’t any particular reason for that number, other than it is small enough to handle, but large enough to lend a bit of power when it comes to the stats. It turned out that I actually ended up having to sample far more (30K), to get enough commenters, but didn’t know that until after I started. Research is messy.

Programming… shudder

Broadly, I wanted to collect a sample of individual user’s histories on Digg. This is made much easier because Digg, like many web services today, provides an API or “application programming interface,” a way of tapping into the databases that drive the system. The good news is that this makes it much easier to access the data if you know a little bit about programming. So, the unfortunate first step of this is learning to program, just a little, or engaging in enough “social engineering” to get someone who can program to help you. If you’ve done anything advanced in SPSS, SAS, Excel, or the like, you probably have done some programming without knowing it, but even if you’ve never done any programming before, taking this one step will help you way beyond this project. You don’t even have to be a good programmer (I’m not), you just need to be able to get the computer to do what you want it to. Really, programming is just a matter of being very explicit about what you want to do, and then putting that in a standardized format (computer language).

To that end, the first step I took was to download a Python installer (for free) here. One of the nice things about Python is that it is easy to learn, but powerful enough for people to do just about anything they want to. For what follows, I assume a very basic understanding of Python. If you already have some programming experience, see Mark Pilgrim’s Dive Into Python. There are lots of intros out there if you are just getting started: peruse your local bookstore.

Generally, when a service offers an API, someone will put together a tool (a “wrapper”) that makes it easy to access in Python and other popular programming languages. Derek van Vliet has written such a package for the Digg API, and so I downloaded it (PyDigg), again, for free, and put it in my Python directory.

A sample

I needed a sample of 2,000 user names. Like many systems, Digg appears to associate an incremental user ID with with the user name. So if you look at the code of my user page, you will find a snippet of code that reads:

<input type="hidden" id="userid" value="242530" />

suggesting that I am the 242,530th person to sign up for a Digg account. This is great: it means that if there were a way to get userpages by number, I could just keep feeding that in and end up with a sample. Unfortunately, I’m not sure that is possible (or at least I didn’t immediately see a way to do it through the API).

We also need to know how many users there are if we are going to sample them. Luckily, that’s pretty easy to find out here. As of the time of this posting, that number is just shy of three million. So, if we take random users from that list, we should end up with a workable sample. In practice, we may want to limit the sample to particular registration dates, or users that have a certain minimum of activity, but for now, we’ll just try for a straight sample of all registered users.

The script

The PyDigg site gives a short sample script showing how it works, and a glance through its code provides a good indication of the sorts of things it can provide for you. Here is the script I used to assemble a sample of 2,000 users, with some comments as we go:

from digg import *
from random import *
from time import *

We start by opening up three libraries that are needed for our task. The Digg library, obviously, gives us access to the Digg API, “random” gives us a function that produces random numbers, and we’ll use a function from “time” to slow down our requests to Digg, and not hog the resource.

TOTALUSERS = 2902633
APPKEY = 'http%3A%2F%2Falex.halavais.net'
SAMPLESIZE = 2000
PAUSE = 60
FILENAME = 'sampleusers.txt'

We start out with some constants. We could automatically get the total number of users via the API, but really, since this is a one-off script, that’s not necessary: easy enough to do it as a human, looking the the XML returned. The APPKEY is an application key that tells Digg who you are. Usually, it is a page for the program that is running. In this case, it is just a link to my blog, assuming that Digg can track me down that way. Please don’t use that for your own program :). The SAMPLESIZE is how many users I want to collect, and the PAUSE is the number of seconds I want it to idle between calls. Digg doesn’t give a hard number here, just asks you to be polite. A full minute may be overkill, but since you are going to wait anyway, why not just let it run overnight? Finally, FILENAME is where you want all this stuff stored.

Under other conditions, you would probably set up something to read this from the command line, get it from an initiation file, or query the user for it. But since I am going to be the only one using the program,and only using it once, I just hard code in the constants.

print "starting..."
d = Digg(APPKEY)
thesample = []

We start out with some basics. As with all the “prints” in the script, this one just lets us know what’s going on. The d= statement creates an instance of the Digg object; something that happens with any script that uses the PyDigg library so that you can access it. We also create an empty list of objects that will be our sample.

while len(thesample) < SAMPLESIZE:	
  cursor = int(random()*TOTALUSERS)	
  print "Getting data from offset ", cursor, '... ',
  users = d.getUsers(count=1, offset=cursor)

Here we begin the loop that makes up the program as a whole. As long as the length of our list (thesample)–which you will recall starts out at length zero–is less than the number we want (2000), we keep executing the instructions below. First we assign a random number from 0 to the total number of users to the variable “cursor”. Then, we use the “getUsers” method to pull one user record from that offset number. In practice, it would probably be good to have some error checking and exception handling here, but again–if we are only ever going to run this once or twice, that is probably overkill.

  if users[0] not in thesample:
    print "writing...",
    thesample.append(users[0])
    f = open(FILENAME,'a')
    f.write(users[0].name)
    f.write(',')
    if users[0].registered:
      f.write(users[0].registered)
    else:
      f.write('0') 
    f.write('\n')				# and go to the next line.
    f.close()	

OK, this chunk writes out the record we have snatched, assuming we don’t already have it (we don’t want repeats in the sample). You might reasonably ask why we don’t just write it all out at the end, since we’re keeping a list of users in “thesample.” We could do that, but if for some reason the power goes out overnight, or you hit an error, or your cat hits CTRL-C, you have lost all that data. So, since we are taking our time anyway, why not do complete writes each time?

If this user object isn’t in our list (thesample) we add it in. We then open up our file to append a new line. The getUsers method returns a list of users; in our case, a list that includes only one object, but a list nonetheless. So we are concerned with the “zero-th” item on that list, and the “name” portion of this, which we write out to the file. We follow that with a comma (all the better to read in later, and I am assuming that commas are not allowed in usernames), and if the object has it, the date of registration. Then we close the file up, to make sure that nothing gets lost in a buffer if there is an abnormal termination.

  print "pausing...",
  sleep(PAUSE)
  print len(thesample), "finished."

We take a short breather, and print out the length of our sample thus far, before going back up and starting the loop again, assuming that we haven’t yet met our goal.

print "done."

Just in case it isn’t obvious.

In practice, again, things were a bit messier, and you may find the above doesn’t work out exactly for you. For example, you probably want to set your default character set to unicode, and do a little error checking. But in spirit, the above shows how easy it is to get stuff.

I won’t go into the process of pulling the comments into a comma-separated file, since it is largely identical to the above process. If you want to do this, take a look at the objects in the Digg library, and see what sorts of things they return.

From here

Having collected a lot of data, we need to figure out how to work it into something usable. Next (in part 4), I will talk a little bit of ways of dealing with all of this data and munging it into formats that are useful.

Posted in General | Tagged , , , | Leave a comment

Grace Jones

Still weird after all these years, and thank goodness for that.

Posted in General | Tagged | Leave a comment

[the making of, pt. 2] Assembling the lit review

(This is the second in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Overlapping sheets, not a point

A common response I get back from undergrads undertaking their first literature review is that they cannot find anything. This is usually an indicator of a systematic problem, since a usual issue is having too much to cover in a lit review, not too little. Generally, this comes in the form of “I can’t find anything on advances in the pizza delivery business in Peru during the 1990s.” If you could find something exactly on point for your work, you might have a problem: after all, what are you adding to the conversation?

While you should definitely look for “near misses” for your own narrow research topic, generally you are going to have stretch out a little broader: any papers or books on pizza delivery are probably worth checking out, whether or not they are in Peru, related to business, or set in the 1990s. Likewise, you are probably interested in the food delivery business in Peru more broadly as well. In other words, by assembling a quilt made up of existing slips of material, you can shape your own literature. Your work should be the point that binds together otherwise relatively disconnected pieces of work. Some of these may not fit together so neatly, others are in much better shape.

Keeping track

Generally, I keep a few documents going,at least when I am organized enough to do so. The main document keeps track of things I’ve found. This includes a full citation, and either relevant quotes (clearly indicated with quotation marks so that I don’t make mistakes later) or summaries of the material. (Many people keep these on notecards, or the digital equivalent, but I have always just kept it all in a document.) A second document records the citations and other information of things I think I should take a look at. A third document includes a list of key search terms and authors that might help in searching for materials.

I generally start with the last list, looking for a set of keywords, works, authors, or phrases that I can use to locate more information. I get these from my own assumptions, or sometimes from places like Wikipedia and other general sources of information. I then use these keywords in a series of places. I generally begin on Google Scholar, these days. I might also make use of ComAbstracts and similar resources. From these, I end up with a set of citations of things I should check out. For articles, I can generally check them from home. These days, I am also likely to check Google Books to see whether I really need a particular book or not before making the trek to the library.

As I go through these, I add to the search bibliography and search terms documents. Often, an article will contain little of interest other than its reference to other literature. It is a very iterative process. Eventually, I feel like I am seeing the same citations and names consistently, and have a handle on that particular segment of the literature.

Finally, for larger projects I try to assemble the bibliographic resources in a citation manager. I used Endnote for many years, with varying degrees of success. I have now switched over to Zotero, and finding it useful. Zotero is a free plug-in for Firefox that serves as a bibliographic manager, and provides a nice way to organize your resources and notes on the things you are citing. No matter your approach, make sure you have complete citations organized in some electronic format. Cleaning up citations and making certain the bibliography is correct seems consistently to take me three or four times as long as I expect it to.

As a practical matter, the above (like the cake) is a lie. I have generally internalized those three documents, and write the lit review as I am collecting information and sources. (Now, if only I could–like Jeremy Hilary Boob–write a positive review of my article as I was composing it!) If, however, you don’t have a lot of experience writing this sort of stuff, a bit of structure can’t hurt, and the “three document” approach has worked for students in the past.

Useful Areas

So what sort of things am I looking for? Obviously, I should at least look at anything that touches on Digg and the culture surrounding it and other collaborative filtering sites. There is surprisingly little literature, I suspect, that deals with the current crop of collaborative filters that involve interaction (Digg, Reddit, StumbleUpon, flock, and several dozen others). Much of what is out there will be technical in nature, and less likely to go into much depth in terms of users or social issues.

I also want to lay the foundation of operant conditioning, particularly as it applies to user interfaces and behaviors within a discourse community. It’s been a long time since I encountered this stuff in intro psych, and I’m not entirely sure where I’m going to find stuff here, or if I will. I suspect that since creating motivation to participate is particularly important in the area of education, there may be some work that looks particularly at reward systems and how they affect behavior in that literature.

Clearly, some of this has to do with the question of how people join groups and become acculturated to them. Although it is rare that such processes have such an obvious marker of success, it may be that there are models that can be applied to the processes observed on Digg.

There may be some work on reputation systems that is applicable. In particular, systems like those found on eBay (where you can be ranked up or down by those with whom you have interactions) might be applicable also to this work.

Finally, I’m hoping I can mine some of my own material. In fact, I’ll probably start there. I presented a paper about Digg and elections at the National Communications Association meeting last year, there is a paper I worked on with Alex Tan that analyzed the structure of eBay ratings, and, as I mentioned, I may be able to draw on a chapter from my dissertation.

Reputation Systems

Since I’m treating this a bit like a cooking show, I won’t go through the entire process, but I’ll walk through one of those pieces that will be woven into a literature review: the literature on reputation systems. As I noted, and to the consternation of traditionalists and my own surprise, Wikipedia has become a useful part of my research process, so I head over there for a survey.

Although the entry itself doesn’t go much beyond a definition, it does link to a few articles, including one I vaguely remember reading in CACM, and more importantly to a treasure cache: this site, which includes a bibliography of relevant research papers. Yay! Always nice when someone does a lot of the work for you. I go about assembling these, looking for information in them that informs my own work, and looking through their bibliographies to find common ancestors and theoretical foundations that are shared.

I also assemble a set of key phrases that seem to be coming up and run them through Google Scholar. Google Scholar is nice because it provides the ability to easily search for citing documents, and move forward to the most recent literature. In all, reading and writing about reputation systems takes only an hour or two, and yields about a paragraph of my literature review.

As with writing generally, if you are stuck, write something. Extract some reference, any reference, and start tunneling into it. Generally being stuck means you are not moving, and the easiest way to start moving is simply to start moving–in any direction at all.

And Everything Else

I generally find that I need to revisit the literature review after the research itself is done, but after working though the areas above and finding out how they connect, I’ve managed to set the foundation of my research, show that it hasn’t already been done (a common issue), and show that the questions I am asking are interesting and important, and will help to build on the existing literature. I ended up organizing the brief review under three headings:

* Filtering People and Content, where I talk about reputation systems and collaborative filtering
* Digg, where I talk more about how the site works
* Responses to Evaluation, where I look a bit at forms of learning and the process of becoming a member of a community.

From here, I have to figure out how it is I am going to do what I am going to do. Generally, this means planning out a method in some detail. The reasons for this are varied, but particularly in research that is collaborative, you will find that you need to propose the research before you can move forward. This may be a proposal to your committee, if you are a graduate student, or to a funding agency. If you are making use of human subjects, you will have to describe to a human subjects board what you plan to do in order to gain their approval.

Here, I know what I need (the public data from a sample of Diggers, organized in such a way that I can make sense of it), and so it is a matter of muddling through. In the next installment, I deal with getting an initial sample.

[Continue on to part 3…]

Posted in Research | Tagged , , | 3 Comments

Are you someone?

Posted in General | Tagged | Leave a comment