Archive for the 'General' Category
IR 9.0 Official tag: ir9
Monday, October 6th, 2008The Internet Research 9.0 conference begins in Copenhagen next week. For those attending (or interested in following from afar), please use the tag “ir9″ everywhere tagging is allowed: Flickr, Del.icio.us, your blog, walls, people, etc.
[the making of, pt. 6] Are you experienced?
Thursday, October 2nd, 2008This is the sixth in a series of posts about the piece of research I am doing on Digg. You can read it from the beginning if you are interested. In the last section I showed a correlation between how much of a response people got from their comments and their propensity to contribute future comments to the community. In this section, I question whether we can observe some form of “learning” or “training” over time among Digg commenters. Do they figure out how to garner more Diggs, either by learning on an individual basis, or by attrition?
Are later comments better Dugg?
You will recall that we have numbered the comments for each user in our sample from the earliest to the most recent. If people are learning to become more acceptable to the community, we should see a significant difference in responses (Diggs and replies) between people’s first posts and their 25th posts.
Loading all the data into R, I find a fairly strong correlation between post number and upward diggs (.28), as well as with downward diggs (.11), and replies (.08). I’d like to show this as a boxplot, so you can clearly see the growing abilities of users, but R is giving me problems. The issue is simple enough: although I can “turn off” the plotting of outliers outside of the boxes, it still makes the size of the chart based on these. Since one of the comments received over 1,500 diggs up, it means my boxes (which have median values in the twos and threes) are sitting at the bottom of the graph as little lines. After a little digging in the help file, I figure out how to assign limits to the y axis (with a ylim=c(0,10)), and I generate the figure seen to the right.
But this raises the question of what creates this increase. Like failing public high schools, some of the rise in positive marks might just be because the less capable Diggers are dropping out. We need to figure out if this is messing with our results.
Dropping out the Dropouts
In order to filter out the dropouts, I turn to… nope, not Python this time. I could, but it’s just as easy to sort all the comments in Excel by order, so that all the 30th comments are in one place on the worksheet. I then copy and paste these 812 usernames to a new sheet. In the last column of the main sheet, I write up a function that says, if the username is on that list, and if the number of this comment is 30th or less, print a 1 in this column; otherwise, print a 0. If you are curious what that function looks like precisely, it’s this:
=IF(I178345<30,IF(ISNA(VLOOKUP(D178345,have30!$A$1:$B$812,1,FALSE)),"",1),"")
I can now sort the list by this new column, and I have all the first 30 comments, by users who have made at least 30 comments, in one place. I pull these into R and rerun the correlations. It turns out that–no surprise–they are reduced. The correlations to buries and responses are near zero, and to diggs are at 0.19.
I’m actually pretty happy with a 0.19 correlation. It means that there is something going on. But I’m curious as to what reviewers will think. The idea of a strong correlation is a bit arbitrary: it depends on what you are doing. If I designed a program that, over a six month period, correlated at -0.19 with body weight, or crime rates, or whatever, it would be really important. The open question is whether there are other stable factors that can explain this, or if the rest of the variability is due to, say, the fact that humans are not ants and tend to do unpredictable stuff now and again. Obviously, this cries out for some form of factor analysis, but I’m not sure how many of the other factors are measurable, or what they might be.
Hidden in these numbers, I suspected, were trolls: experienced users who were seeking out the dark side, learning to be more and more execrable during their first 30 comments. I wanted to get at the average scores of these folks, so I used the “subtotal” function in Excel (which can give you “subaverages” as well), and did some copying, pasting, and sorting to be able to identify the extreme ends. The average average was a score of about 3. The most “successful” poster managed to get an average score of over 33. She had started out with a bit of a bumpy ride. In fact, the first 24 posts had an average score of less than zero. But she cracked the code by the 25th comment, and received scores consistently in the hundreds for the last five of this chunk of data.
On the other end was someone who had an average score of -11. Among the first thirty entries, only one rose above zero, and the rest got progressively worse ratings, employing a litany of racist and sexist slurs, along with attacks on other sacred cows on Digg. It may have been she was just after the negative attention, and not paying any mind to the quantification of that in the form of a Digg score, but it’s clear that the objective was not to fit in.
Enough with the numbers!
I wanted to balance out the questions of timing and learning with at least an initial look at content. I always like to use mixed methods, even though it tends to make things harder to publish. At some point I really need to learn the lesson of Least Publishable Units, and split my work into multiple papers, but I’m not disciplined enough to do that yet. So, in the next sections I take on the question of what kinds of content seem to affect ratings.
Does (American-style) Democracy Work?
Wednesday, September 24th, 2008“If they lie about us, then we will correct the record,” Obama said. “But this election is too important, too serious, to be playing silly games.”
The media has been all about the campaign lies this year. Factcheck.org, a site with an august history, is probably getting more hits this month than it ever has. Heard an interesting interview on Talk of the Nation yesterday about the legacy of Lee Atwater, the success of the Willy Horton ad, and its effect on Republican campaigning. The argument was that Republicans have no problem playing to the base instincts of the populace in order to win an election: all’s fair in love and war. Democrats have found no way to battle against this, and at least traditionally have not engaged in the same kinds of behavior (coded race-baiting, fear tactics, wrapping yourself in the flag, etc.).
I had been tracking on this and was deeply gratified that this election (I thought) would be different. I saw no way that McCain would turn to these tactics, after the way they had been used against him in the past by fellow Republicans. I’m still a little shocked at how wrong I was. Both campaigns have stretched the truth, but that McCain continues to claim–not just in ads but in interviews long after it has been shown by nonpartisan groups to be an outright lie–that Obama is seeking to raise taxes for most people, or that he wanted sex ed for kindergarteners, is way beyond the pale. He knows this, but has subscribed to the belief that it’s OK to lie, as long as you win the election, a bargain that has worked well for the Republicans in the past three decades. It is perhaps most ironic because of his “I’d rather lose the election” talk. Obviously, he’s willing to lie about the issues in order to win.
I’m not unsympathetic. After all, how do you convince the 95% of the voting population that will be helped by Obama’s tax cuts that they should instead pay more so that the top 5% can pay less. The richest 1% in America are now paying way less tax than they did in the 70s, while the middle class is paying way more. The US has a ridiculous (and preposterously trending) Gini coefficient. Really, there are two choices, either lie now about your own policy (”Read my lips”) or tell the truth about your policies, and lie about your opponent’s.
I talked to a colleague who was convinced that the economic situation means that Obama can’t lose. I’m not so sure. The Republicans have a playbook that works, that plays on the fears of the undecided voters, who tend to be–sorry–pretty uninformed and lacking common sense. They also have this odd schizophrenia in their appeal. A candidate who claims you aren’t really rich until you are making $5 million a year, and then suggests that Democrats are elitist, seems to be running on a campaign of “rationality is outdated.” And maybe it is.
But they do manage to collect a very large group of people who tend to be less educated than Obama supporters, and less wealthy. In some ways, I am the typical Obama supporter: a professor with an income well above the national average. (Professors disproportionately support Obama.) Some of the things that educated folks like about Obama–that he is able to articulate clear policy positions rather than playing to fears and “values”–may end up losing him the race. Undecideds don’t care about issues, they want to see a fight, and the bloodier the better.
Democrats are beginning to oblige, much to my disappointment. In recent television ads, Obama is suggesting that if McCain were president, seniors would have lost their social security checks in the current Wall Street meltdown. This goes beyond stretching the truth–it’s just plain wrong. Future retirees may have lost a significant amount of their benefits, but that’s not what the ad says. It preys on the fears of a demographic Obama desperately needs to win. And what I thought was looking like a Swiftboating effort looks to be taking hold, as does a call to release McCain’s medical records. (The latter is a borderline issue. I think most voters can handicap McCain’s likely survival rate, as the oldest president, who has already suffered medical issues, and is exhibiting some dementia already, without detailed medical evidence that wouldn’t add much to the discussion anyway.)
On one hand, the idea that McCain/Palin would come into the White House based on a campaign of lies, fear, and doubt is extraordinarily depressing. If the Bush elections were not a clear enough indication that our campaigning system is broken, a McCain win would be. I am not one of those people who says “I’m moving to Canada.” Or, rather, I am, but that’s just because Canada seems like a nice place to live sometimes. However, if McCain wins, I will be disappointed not just in the outcome, but in the process that has allowed Americans to vote against their own interest, and the interests in the country. If McCain won because people believed in his policies, I would have far less of a problem–compromise is at the heart of a democracy. But to win on the basis of deception dishonors not only his own legacy, but the country he seeks to lead.
Now it seems that some in Obama’s camp are willing to jump down that slope with McCain, hitting back with untrue ads, even as Obama says he wants to stay above the fray. If Obama wins by misleading the public, it will likewise shake my faith in our democracy. There is enough to attack McCain on that is true, there is no need to willfully misconstrue his remarks (as he has done with Obama: pigs and lipstick–no one ever said Americans were not gullible), or make up policies. Just tell Americans the truth: that McCain wants to outlaw abortion and gay marriage, he won’t negotiate with enemies like Iran, he wants working class Americans to pay more tax so that the rich and corporations can pay less, he is uninformed on the economy, doesn’t know who is in charge in Spain, or that there is no Czechoslovakia, or that Iran isn’t training al Qaeda, he is self-made only in that he married into wealth, he, like Bush, is proud of his lack of intellectual pursuit–all of these things are true, and still speak to his inability to lead. Heck, clips of McCain himself saying tremendously stupid things are probably the best attack ad. No need to do voiceovers or text: just show Americans what he has in store for them.
Or better yet, suck it up, correct the lies, and tell the American people what you are going to do to help get us out of this mess. The ideal: that Obama is elected, and manages to do so without giving into the mud-slinging that both candidates said they would avoid.
[the making of, pt. 3] Creating a sample
Monday, September 22nd, 2008(This is the third in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)
Having put together the bones of a literature review in part 2, I now need to collect some data. My initial aim was a round number: 2,000 users, selected at random. There isn’t any particular reason for that number, other than it is small enough to handle, but large enough to lend a bit of power when it comes to the stats. It turned out that I actually ended up having to sample far more (30K), to get enough commenters, but didn’t know that until after I started. Research is messy.
Programming… shudder
Broadly, I wanted to collect a sample of individual user’s histories on Digg. This is made much easier because Digg, like many web services today, provides an API or “application programming interface,” a way of tapping into the databases that drive the system. The good news is that this makes it much easier to access the data if you know a little bit about programming. So, the unfortunate first step of this is learning to program, just a little, or engaging in enough “social engineering” to get someone who can program to help you. If you’ve done anything advanced in SPSS, SAS, Excel, or the like, you probably have done some programming without knowing it, but even if you’ve never done any programming before, taking this one step will help you way beyond this project. You don’t even have to be a good programmer (I’m not), you just need to be able to get the computer to do what you want it to. Really, programming is just a matter of being very explicit about what you want to do, and then putting that in a standardized format (computer language).
To that end, the first step I took was to download a Python installer (for free) here. One of the nice things about Python is that it is easy to learn, but powerful enough for people to do just about anything they want to. For what follows, I assume a very basic understanding of Python. If you already have some programming experience, see Mark Pilgrim’s Dive Into Python. There are lots of intros out there if you are just getting started: peruse your local bookstore.
Generally, when a service offers an API, someone will put together a tool (a “wrapper”) that makes it easy to access in Python and other popular programming languages. Derek van Vliet has written such a package for the Digg API, and so I downloaded it (PyDigg), again, for free, and put it in my Python directory.
A sample
I needed a sample of 2,000 user names. Like many systems, Digg appears to associate an incremental user ID with with the user name. So if you look at the code of my user page, you will find a snippet of code that reads:
<input type="hidden" id="userid" value="242530" />
suggesting that I am the 242,530th person to sign up for a Digg account. This is great: it means that if there were a way to get userpages by number, I could just keep feeding that in and end up with a sample. Unfortunately, I’m not sure that is possible (or at least I didn’t immediately see a way to do it through the API).
We also need to know how many users there are if we are going to sample them. Luckily, that’s pretty easy to find out here. As of the time of this posting, that number is just shy of three million. So, if we take random users from that list, we should end up with a workable sample. In practice, we may want to limit the sample to particular registration dates, or users that have a certain minimum of activity, but for now, we’ll just try for a straight sample of all registered users.
The script
The PyDigg site gives a short sample script showing how it works, and a glance through its code provides a good indication of the sorts of things it can provide for you. Here is the script I used to assemble a sample of 2,000 users, with some comments as we go:
from digg import * from random import * from time import *
We start by opening up three libraries that are needed for our task. The Digg library, obviously, gives us access to the Digg API, “random” gives us a function that produces random numbers, and we’ll use a function from “time” to slow down our requests to Digg, and not hog the resource.
TOTALUSERS = 2902633 APPKEY = 'http%3A%2F%2Falex.halavais.net' SAMPLESIZE = 2000 PAUSE = 60 FILENAME = 'sampleusers.txt'
We start out with some constants. We could automatically get the total number of users via the API, but really, since this is a one-off script, that’s not necessary: easy enough to do it as a human, looking the the XML returned. The APPKEY is an application key that tells Digg who you are. Usually, it is a page for the program that is running. In this case, it is just a link to my blog, assuming that Digg can track me down that way. Please don’t use that for your own program :). The SAMPLESIZE is how many users I want to collect, and the PAUSE is the number of seconds I want it to idle between calls. Digg doesn’t give a hard number here, just asks you to be polite. A full minute may be overkill, but since you are going to wait anyway, why not just let it run overnight? Finally, FILENAME is where you want all this stuff stored.
Under other conditions, you would probably set up something to read this from the command line, get it from an initiation file, or query the user for it. But since I am going to be the only one using the program,and only using it once, I just hard code in the constants.
print "starting..." d = Digg(APPKEY) thesample = []
We start out with some basics. As with all the “prints” in the script, this one just lets us know what’s going on. The d= statement creates an instance of the Digg object; something that happens with any script that uses the PyDigg library so that you can access it. We also create an empty list of objects that will be our sample.
while len(thesample) < SAMPLESIZE: cursor = int(random()*TOTALUSERS) print "Getting data from offset ", cursor, '... ', users = d.getUsers(count=1, offset=cursor)
Here we begin the loop that makes up the program as a whole. As long as the length of our list (thesample)–which you will recall starts out at length zero–is less than the number we want (2000), we keep executing the instructions below. First we assign a random number from 0 to the total number of users to the variable “cursor”. Then, we use the “getUsers” method to pull one user record from that offset number. In practice, it would probably be good to have some error checking and exception handling here, but again–if we are only ever going to run this once or twice, that is probably overkill.
if users[0] not in thesample:
print "writing...",
thesample.append(users[0])
f = open(FILENAME,'a')
f.write(users[0].name)
f.write(',')
if users[0].registered:
f.write(users[0].registered)
else:
f.write('0')
f.write('\n') # and go to the next line.
f.close()
OK, this chunk writes out the record we have snatched, assuming we don’t already have it (we don’t want repeats in the sample). You might reasonably ask why we don’t just write it all out at the end, since we’re keeping a list of users in “thesample.” We could do that, but if for some reason the power goes out overnight, or you hit an error, or your cat hits CTRL-C, you have lost all that data. So, since we are taking our time anyway, why not do complete writes each time?
If this user object isn’t in our list (thesample) we add it in. We then open up our file to append a new line. The getUsers method returns a list of users; in our case, a list that includes only one object, but a list nonetheless. So we are concerned with the “zero-th” item on that list, and the “name” portion of this, which we write out to the file. We follow that with a comma (all the better to read in later, and I am assuming that commas are not allowed in usernames), and if the object has it, the date of registration. Then we close the file up, to make sure that nothing gets lost in a buffer if there is an abnormal termination.
print "pausing...", sleep(PAUSE) print len(thesample), "finished."
We take a short breather, and print out the length of our sample thus far, before going back up and starting the loop again, assuming that we haven’t yet met our goal.
print "done."
Just in case it isn’t obvious.
In practice, again, things were a bit messier, and you may find the above doesn’t work out exactly for you. For example, you probably want to set your default character set to unicode, and do a little error checking. But in spirit, the above shows how easy it is to get stuff.
I won’t go into the process of pulling the comments into a comma-separated file, since it is largely identical to the above process. If you want to do this, take a look at the objects in the Digg library, and see what sorts of things they return.
From here
Having collected a lot of data, we need to figure out how to work it into something usable. Next (in part 4), I will talk a little bit of ways of dealing with all of this data and munging it into formats that are useful.
Grace Jones
Monday, September 22nd, 2008Still weird after all these years, and thank goodness for that.