Search Engine Society on Amazon

Via Twitter:

cshirky: Twitter office pool: How many books will come out in 2008 with ‘Google’ in the title? Round your answer to nearest dozen…

Well, not mine :). It’s on Amazon, now, with book preview, which means it must be real. However, the release date listed by Amazon is December 31. If you were planning on lining up late, don’t worry, I think that means that it will come out the midnight before New Year’s Eve, so you don’t have to cancel any New Years Eve plans.

Drat, I thought it would be out sooner. It’s certainly not obsolete (yet), but the facts on the ground (as Rumsfeld might say) are moving quickly. I think it may come out earlier in the UK (?).

Posted in Technology | Tagged | Leave a comment

Redirect Loop in WordPress

In the “for what it’s worth” department…

I’ve been periodically getting a “Redirect Loop” error in WordPress. I wasn’t sure what the source was. I tried the things you are supposed to do (deleting cookies, especially), and no luck. I went in via FTP and temporarily swapped all my plugins out of the plugins directory. This let me log in, which suggests that one of the many plugins is the culprit. No, I don’t know which one. I’ll try adding them in again one-by-one to track it down.

For now, if you are having a frustrating Redirect Loop problem, look to the plugins. That is all.

Posted in Uncategorized | Tagged | 1 Comment

[the making of, pt. 6] Are you experienced?

This is the sixth in a series of posts about the piece of research I am doing on Digg. You can read it from the beginning if you are interested. In the last section I showed a correlation between how much of a response people got from their comments and their propensity to contribute future comments to the community. In this section, I question whether we can observe some form of “learning” or “training” over time among Digg commenters. Do they figure out how to garner more Diggs, either by learning on an individual basis, or by attrition?

Are later comments better Dugg?

You will recall that we have numbered the comments for each user in our sample from the earliest to the most recent. If people are learning to become more acceptable to the community, we should see a significant difference in responses (Diggs and replies) between people’s first posts and their 25th posts.

Loading all the data into R, I find a fairly strong correlation between post number and upward diggs (.28), as well as with downward diggs (.11), and replies (.08). I’d like to show this as a boxplot, so you can clearly see the growing abilities of users, but R is giving me problems. The issue is simple enough: although I can “turn off” the plotting of outliers outside of the boxes, it still makes the size of the chart based on these. Since one of the comments received over 1,500 diggs up, it means my boxes (which have median values in the twos and threes) are sitting at the bottom of the graph as little lines. After a little digging in the help file, I figure out how to assign limits to the y axis (with a ylim=c(0,10)), and I generate the figure seen to the right.

But this raises the question of what creates this increase. Like failing public high schools, some of the rise in positive marks might just be because the less capable Diggers are dropping out. We need to figure out if this is messing with our results.

Dropping out the Dropouts

In order to filter out the dropouts, I turn to… nope, not Python this time. I could, but it’s just as easy to sort all the comments in Excel by order, so that all the 30th comments are in one place on the worksheet. I then copy and paste these 812 usernames to a new sheet. In the last column of the main sheet, I write up a function that says, if the username is on that list, and if the number of this comment is 30th or less, print a 1 in this column; otherwise, print a 0. If you are curious what that function looks like precisely, it’s this:

=IF(I178345<30,IF(ISNA(VLOOKUP(D178345,have30!$A$1:$B$812,1,FALSE)),"",1),"")

I can now sort the list by this new column, and I have all the first 30 comments, by users who have made at least 30 comments, in one place. I pull these into R and rerun the correlations. It turns out that–no surprise–they are reduced. The correlations to buries and responses are near zero, and to diggs are at 0.19.

I’m actually pretty happy with a 0.19 correlation. It means that there is something going on. But I’m curious as to what reviewers will think. The idea of a strong correlation is a bit arbitrary: it depends on what you are doing. If I designed a program that, over a six month period, correlated at -0.19 with body weight, or crime rates, or whatever, it would be really important. The open question is whether there are other stable factors that can explain this, or if the rest of the variability is due to, say, the fact that humans are not ants and tend to do unpredictable stuff now and again. Obviously, this cries out for some form of factor analysis, but I’m not sure how many of the other factors are measurable, or what they might be.

Hidden in these numbers, I suspected, were trolls: experienced users who were seeking out the dark side, learning to be more and more execrable during their first 30 comments. I wanted to get at the average scores of these folks, so I used the “subtotal” function in Excel (which can give you “subaverages” as well), and did some copying, pasting, and sorting to be able to identify the extreme ends. The average average was a score of about 3. The most “successful” poster managed to get an average score of over 33. She had started out with a bit of a bumpy ride. In fact, the first 24 posts had an average score of less than zero. But she cracked the code by the 25th comment, and received scores consistently in the hundreds for the last five of this chunk of data.

On the other end was someone who had an average score of -11. Among the first thirty entries, only one rose above zero, and the rest got progressively worse ratings, employing a litany of racist and sexist slurs, along with attacks on other sacred cows on Digg. It may have been she was just after the negative attention, and not paying any mind to the quantification of that in the form of a Digg score, but it’s clear that the objective was not to fit in.

Enough with the numbers!

I wanted to balance out the questions of timing and learning with at least an initial look at content. I always like to use mixed methods, even though it tends to make things harder to publish. At some point I really need to learn the lesson of Least Publishable Units, and split my work into multiple papers, but I’m not disciplined enough to do that yet. So, in the next sections I take on the question of what kinds of content seem to affect ratings.

[Update: I pretty much ran out of steam on documenting this. The Dr. Suess inspired presentation for the Internet Research conference is here, and a version of the article was eventually published in Information, Communication, and Society.]

Posted in Uncategorized | Tagged , | Leave a comment

[the making of, pt. 5] Rat in a cage

[This is the fifth in a series of posts about a piece of research I am doing on Digg. If you like, you can start at the beginning. At this point, we have the data, and are manipulating it in various ways to try to show some relationships.]

Extracting some variables

One of the things we want to see is whether diggs (either up or down) affect the amount of time it takes to comment again. We can also look at the case of the last comment: does a bad rating make it more likely to quit? The two of these are related.

We have a hint from an earlier study uncovered in the lit review. Lampe & Johnson suggest that on Slashdot, any link (down or up) was likely to encourage participation, at least for newcomers. Persistent bad ratings seemed to drum people out. So, we want to see whether there is a relationship between the ratings for a comment, and the latency before the next comment is posted.

We have some of the variables we need. We have the number of up and down Diggs. Although we don’t have the total rating, that’s easy enough to derive–heck we can do it in our stats program or Excel if we want to. We also have replies, and because we have it, we’re going to look at it. If it turns out there is a relationship worth reporting there, we can go back and include it in the report. Wasn’t really part of the initial plan, but it’s there, and a relationship seems plausible, so we should check it.

The main thing that isn’t there is a clear amount of time between the post under consideration and the subsequent post. As the last section suggested, we have a lot of people who posted nothing, or only had a single post, but for those that had multiple posts, we need to (surprised) write a script that will run through and figure out latencies.

This is actually the most complex script so far. It needs to find all the comments posted by a given user, then put them in chronological order. It then needs to find the difference in time between each pair of comments. In each case, Digg provides the time of the comment in Unix format–that is, seconds since January 1, 1970. So, we can generate the difference in seconds. Obviously, if we don’t have a post following a given post, we can’t find such a latency. In that case, we fill that slot with a -1 to indicate that this was a “final comment” for the user. That may mean the user has quit, or simply that she hasn’t posted a comment again before our period of collection.

Also, for reasons that will become clear later on, we store an indicator of the order of the comments. It will make it easier to find the first, fifth, or twelfth post when we want to later on.

Is there a correlation worth looking at?

Our first step here is to look to see if there is an obvious correlation in a scatterplot of some of the variables. Why bother with a scatterplot? It would be convenient to make use of Pearson’s r to see whether there is a significant correlation, but it assumes a normal distribution of the variables. It was pretty obvious that this was a non-parametric distribution from the outset (ranking posts, etc.) and so I knew I would be using non-parametric tests (MWW and Spearman’s ρ), but it’s helpful to get a handle on the data.

I didn’t want to look at all the cases: some of the posts were mere seconds after one another (something strange there), so I tossed it into Excel, sorted by the latency, and chopped off anything shorter than 5 minutes. From there, I could just copy and paste items into R as I needed to figure out whether there were some relationships.

I was disappointed to find that the correlations between latency and diggs were fairly weak. It turns out that if you get more diggs, you may return to post a little bit faster, but not much. When you look only at a comparison of posts that had some diggs (including being dugg down!) with those that had none, there is a fairly significant gap. The standard deviation is also extremely high, but with over a hundred thousand cases, we can still say with confidence that there is a difference in averages.

I also took a look at the comments that were “mid stream” in a users Digg career, as compared to those that had no following comment. Now, the trailing comment might just mean that (like me!) they have taken a break from Digg for a while–not that they have quit. But they also include those who posted once or twice and gave up. Here, the differences were even more stark: any sort of feedback increases the likelihood of people coming back.

Note that I’ve just committed the cardinal sin of correlation, above, and it’s easy to commit. It may be that it isn’t that lack of feedback causes attrition, but rather that those who aren’t very into Digg don’t produce content that gets a lot of feedback. In either case, we can say with some certainty that low Diggs tend to go along with less frequent participation by the commenter in the future.

Coming up

In the next segment, I try to see whether experience actually plays a role in how many Diggs you get.

Posted in General | Tagged , | Leave a comment