[the making of, pt. 5] Rat in a cage

[This is the fifth in a series of posts about a piece of research I am doing on Digg. If you like, you can start at the beginning. At this point, we have the data, and are manipulating it in various ways to try to show some relationships.]

Extracting some variables

One of the things we want to see is whether diggs (either up or down) affect the amount of time it takes to comment again. We can also look at the case of the last comment: does a bad rating make it more likely to quit? The two of these are related.

We have a hint from an earlier study uncovered in the lit review. Lampe & Johnson suggest that on Slashdot, any link (down or up) was likely to encourage participation, at least for newcomers. Persistent bad ratings seemed to drum people out. So, we want to see whether there is a relationship between the ratings for a comment, and the latency before the next comment is posted.

We have some of the variables we need. We have the number of up and down Diggs. Although we don’t have the total rating, that’s easy enough to derive–heck we can do it in our stats program or Excel if we want to. We also have replies, and because we have it, we’re going to look at it. If it turns out there is a relationship worth reporting there, we can go back and include it in the report. Wasn’t really part of the initial plan, but it’s there, and a relationship seems plausible, so we should check it.

The main thing that isn’t there is a clear amount of time between the post under consideration and the subsequent post. As the last section suggested, we have a lot of people who posted nothing, or only had a single post, but for those that had multiple posts, we need to (surprised) write a script that will run through and figure out latencies.

This is actually the most complex script so far. It needs to find all the comments posted by a given user, then put them in chronological order. It then needs to find the difference in time between each pair of comments. In each case, Digg provides the time of the comment in Unix format–that is, seconds since January 1, 1970. So, we can generate the difference in seconds. Obviously, if we don’t have a post following a given post, we can’t find such a latency. In that case, we fill that slot with a -1 to indicate that this was a “final comment” for the user. That may mean the user has quit, or simply that she hasn’t posted a comment again before our period of collection.

Also, for reasons that will become clear later on, we store an indicator of the order of the comments. It will make it easier to find the first, fifth, or twelfth post when we want to later on.

Is there a correlation worth looking at?

Our first step here is to look to see if there is an obvious correlation in a scatterplot of some of the variables. Why bother with a scatterplot? It would be convenient to make use of Pearson’s r to see whether there is a significant correlation, but it assumes a normal distribution of the variables. It was pretty obvious that this was a non-parametric distribution from the outset (ranking posts, etc.) and so I knew I would be using non-parametric tests (MWW and Spearman’s ρ), but it’s helpful to get a handle on the data.

I didn’t want to look at all the cases: some of the posts were mere seconds after one another (something strange there), so I tossed it into Excel, sorted by the latency, and chopped off anything shorter than 5 minutes. From there, I could just copy and paste items into R as I needed to figure out whether there were some relationships.

I was disappointed to find that the correlations between latency and diggs were fairly weak. It turns out that if you get more diggs, you may return to post a little bit faster, but not much. When you look only at a comparison of posts that had some diggs (including being dugg down!) with those that had none, there is a fairly significant gap. The standard deviation is also extremely high, but with over a hundred thousand cases, we can still say with confidence that there is a difference in averages.

I also took a look at the comments that were “mid stream” in a users Digg career, as compared to those that had no following comment. Now, the trailing comment might just mean that (like me!) they have taken a break from Digg for a while–not that they have quit. But they also include those who posted once or twice and gave up. Here, the differences were even more stark: any sort of feedback increases the likelihood of people coming back.

Note that I’ve just committed the cardinal sin of correlation, above, and it’s easy to commit. It may be that it isn’t that lack of feedback causes attrition, but rather that those who aren’t very into Digg don’t produce content that gets a lot of feedback. In either case, we can say with some certainty that low Diggs tend to go along with less frequent participation by the commenter in the future.

Coming up

In the next segment, I try to see whether experience actually plays a role in how many Diggs you get.

This entry was posted in General and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>