Digg – A Thaumaturgical Compendium

Copenhagen presentation

alex — Sun, 12 Oct 2008 02:52:12 +0000

Off for Denmark in a few hours. Here are my slides for my Tuesday morning presentation. I didn’t have time to do my now standard video-based presentation style. Much more basic this time around…

Trying Google Presenter, too.

[the making of, pt. 6] Are you experienced?

alex — Thu, 02 Oct 2008 13:54:21 +0000

This is the sixth in a series of posts about the piece of research I am doing on Digg. You can read it from the beginning if you are interested. In the last section I showed a correlation between how much of a response people got from their comments and their propensity to contribute future comments to the community. In this section, I question whether we can observe some form of “learning” or “training” over time among Digg commenters. Do they figure out how to garner more Diggs, either by learning on an individual basis, or by attrition?

Are later comments better Dugg?

You will recall that we have numbered the comments for each user in our sample from the earliest to the most recent. If people are learning to become more acceptable to the community, we should see a significant difference in responses (Diggs and replies) between people’s first posts and their 25th posts.

Loading all the data into R, I find a fairly strong correlation between post number and upward diggs (.28), as well as with downward diggs (.11), and replies (.08). I’d like to show this as a boxplot, so you can clearly see the growing abilities of users, but R is giving me problems. The issue is simple enough: although I can “turn off” the plotting of outliers outside of the boxes, it still makes the size of the chart based on these. Since one of the comments received over 1,500 diggs up, it means my boxes (which have median values in the twos and threes) are sitting at the bottom of the graph as little lines. After a little digging in the help file, I figure out how to assign limits to the y axis (with a ylim=c(0,10)), and I generate the figure seen to the right.

But this raises the question of what creates this increase. Like failing public high schools, some of the rise in positive marks might just be because the less capable Diggers are dropping out. We need to figure out if this is messing with our results.

Dropping out the Dropouts

In order to filter out the dropouts, I turn to… nope, not Python this time. I could, but it’s just as easy to sort all the comments in Excel by order, so that all the 30th comments are in one place on the worksheet. I then copy and paste these 812 usernames to a new sheet. In the last column of the main sheet, I write up a function that says, if the username is on that list, and if the number of this comment is 30th or less, print a 1 in this column; otherwise, print a 0. If you are curious what that function looks like precisely, it’s this:

=IF(I178345<30,IF(ISNA(VLOOKUP(D178345,have30!$A$1:$B$812,1,FALSE)),"",1),"")

I can now sort the list by this new column, and I have all the first 30 comments, by users who have made at least 30 comments, in one place. I pull these into R and rerun the correlations. It turns out that–no surprise–they are reduced. The correlations to buries and responses are near zero, and to diggs are at 0.19.

I’m actually pretty happy with a 0.19 correlation. It means that there is something going on. But I’m curious as to what reviewers will think. The idea of a strong correlation is a bit arbitrary: it depends on what you are doing. If I designed a program that, over a six month period, correlated at -0.19 with body weight, or crime rates, or whatever, it would be really important. The open question is whether there are other stable factors that can explain this, or if the rest of the variability is due to, say, the fact that humans are not ants and tend to do unpredictable stuff now and again. Obviously, this cries out for some form of factor analysis, but I’m not sure how many of the other factors are measurable, or what they might be.

Hidden in these numbers, I suspected, were trolls: experienced users who were seeking out the dark side, learning to be more and more execrable during their first 30 comments. I wanted to get at the average scores of these folks, so I used the “subtotal” function in Excel (which can give you “subaverages” as well), and did some copying, pasting, and sorting to be able to identify the extreme ends. The average average was a score of about 3. The most “successful” poster managed to get an average score of over 33. She had started out with a bit of a bumpy ride. In fact, the first 24 posts had an average score of less than zero. But she cracked the code by the 25th comment, and received scores consistently in the hundreds for the last five of this chunk of data.

On the other end was someone who had an average score of -11. Among the first thirty entries, only one rose above zero, and the rest got progressively worse ratings, employing a litany of racist and sexist slurs, along with attacks on other sacred cows on Digg. It may have been she was just after the negative attention, and not paying any mind to the quantification of that in the form of a Digg score, but it’s clear that the objective was not to fit in.

Enough with the numbers!

I wanted to balance out the questions of timing and learning with at least an initial look at content. I always like to use mixed methods, even though it tends to make things harder to publish. At some point I really need to learn the lesson of Least Publishable Units, and split my work into multiple papers, but I’m not disciplined enough to do that yet. So, in the next sections I take on the question of what kinds of content seem to affect ratings.

[Update: I pretty much ran out of steam on documenting this. The Dr. Suess inspired presentation for the Internet Research conference is here, and a version of the article was eventually published in Information, Communication, and Society.]

[the making of, pt. 5] Rat in a cage

alex — Mon, 29 Sep 2008 17:54:33 +0000

[This is the fifth in a series of posts about a piece of research I am doing on Digg. If you like, you can start at the beginning. At this point, we have the data, and are manipulating it in various ways to try to show some relationships.]

Extracting some variables

One of the things we want to see is whether diggs (either up or down) affect the amount of time it takes to comment again. We can also look at the case of the last comment: does a bad rating make it more likely to quit? The two of these are related.

We have a hint from an earlier study uncovered in the lit review. Lampe & Johnson suggest that on Slashdot, any link (down or up) was likely to encourage participation, at least for newcomers. Persistent bad ratings seemed to drum people out. So, we want to see whether there is a relationship between the ratings for a comment, and the latency before the next comment is posted.

We have some of the variables we need. We have the number of up and down Diggs. Although we don’t have the total rating, that’s easy enough to derive–heck we can do it in our stats program or Excel if we want to. We also have replies, and because we have it, we’re going to look at it. If it turns out there is a relationship worth reporting there, we can go back and include it in the report. Wasn’t really part of the initial plan, but it’s there, and a relationship seems plausible, so we should check it.

The main thing that isn’t there is a clear amount of time between the post under consideration and the subsequent post. As the last section suggested, we have a lot of people who posted nothing, or only had a single post, but for those that had multiple posts, we need to (surprised) write a script that will run through and figure out latencies.

This is actually the most complex script so far. It needs to find all the comments posted by a given user, then put them in chronological order. It then needs to find the difference in time between each pair of comments. In each case, Digg provides the time of the comment in Unix format–that is, seconds since January 1, 1970. So, we can generate the difference in seconds. Obviously, if we don’t have a post following a given post, we can’t find such a latency. In that case, we fill that slot with a -1 to indicate that this was a “final comment” for the user. That may mean the user has quit, or simply that she hasn’t posted a comment again before our period of collection.

Also, for reasons that will become clear later on, we store an indicator of the order of the comments. It will make it easier to find the first, fifth, or twelfth post when we want to later on.

Is there a correlation worth looking at?

Our first step here is to look to see if there is an obvious correlation in a scatterplot of some of the variables. Why bother with a scatterplot? It would be convenient to make use of Pearson’s r to see whether there is a significant correlation, but it assumes a normal distribution of the variables. It was pretty obvious that this was a non-parametric distribution from the outset (ranking posts, etc.) and so I knew I would be using non-parametric tests (MWW and Spearman’s Ï), but it’s helpful to get a handle on the data.

I didn’t want to look at all the cases: some of the posts were mere seconds after one another (something strange there), so I tossed it into Excel, sorted by the latency, and chopped off anything shorter than 5 minutes. From there, I could just copy and paste items into R as I needed to figure out whether there were some relationships.

I was disappointed to find that the correlations between latency and diggs were fairly weak. It turns out that if you get more diggs, you may return to post a little bit faster, but not much. When you look only at a comparison of posts that had some diggs (including being dugg down!) with those that had none, there is a fairly significant gap. The standard deviation is also extremely high, but with over a hundred thousand cases, we can still say with confidence that there is a difference in averages.

I also took a look at the comments that were “mid stream” in a users Digg career, as compared to those that had no following comment. Now, the trailing comment might just mean that (like me!) they have taken a break from Digg for a while–not that they have quit. But they also include those who posted once or twice and gave up. Here, the differences were even more stark: any sort of feedback increases the likelihood of people coming back.

Note that I’ve just committed the cardinal sin of correlation, above, and it’s easy to commit. It may be that it isn’t that lack of feedback causes attrition, but rather that those who aren’t very into Digg don’t produce content that gets a lot of feedback. In either case, we can say with some certainty that low Diggs tend to go along with less frequent participation by the commenter in the future.

Coming up

In the next segment, I try to see whether experience actually plays a role in how many Diggs you get.

[the making of, pt. 4] Basic descriptions of the sample

alex — Fri, 26 Sep 2008 13:56:29 +0000

This is the fourth in a series of posts about a paper I am writing, breaking down the process old-school. It started here. So, in part 3, I talked about how I got the sample of the users (and waived my hands a bit about the sample of the comments). Now, I want to tell my audience (and know myself) the basic structure of the sample.

Counting it up

I can say, for example, how many I collected (30,000), and what the oldest of these accounts is (December of 2004) and what the newest accounts are (cut off at May of 2008). I also want to say something about how many of these post comments, and how many comments I have.

The latter is pretty easy: I dump my comma-delimited file of comments into a plain text editor. I use Notepad2 for this sort of thing, because it has line numbering (making my job easier), and doesn’t–like the original Notepad–crash a Windows system when you try to open very large files. In total, 197,658 comments.

Distribution

So, on average, that’s a lot of comments. But we know that it’s unlikely many people post the “average” number of comments, or even that it is distributed normally around that average. Far more likely is that you have a large number of people who post never or infrequently, and a handful of ~~freaks~~ enthusiasts posting every two minutes or so. What we need to do is count up the comments by user.

So we turn to Python again, and write up a quick script that goes through the 197K comments and counts up how many each user makes. In practice, the program doesn’t find all 30,000 users in our sample, because 23,532 have not posted a comment. The result is a comma delimited file with the user name and number of comments. Now we can construct a histogram.

Histogram

I am a big fan of Excel, and we could use it to create the histogram, but I always seem to spend about 15 minutes figuring out how to do histograms in Excel, relearning it each time. The obvious choice is SPSS, but for a change of pace, I’m going to use a free piece of mathematics software called R.

The reason is simple enough. A quick run through a regular histogram shows that this is a heavily “powered” Pareto distribution. When I plot it as a regular histogram, it comes out as two lines along the axes, and I tiny curve at the origin. One person actually made 6,598 comments, and I had to check the site to make sure there hadn’t been an error. Another posted over 4,000 comments.

So, what we need is a log-log histogram. Although I’m sure there is a function that will do this for me neatly (and I have to admit ignorance when it comes to doing this in SPSS, but I suspect it’s just a matter of checking a box), I’m once again going to turn to Python to write a script that comes up with frequencies (i.e., how many people posted once, twice, … ntimes). I could “bin” these frequencies and come up with something lat looks like a regular histogram, but since folks are not as used to seeing log-log bar charts, I decided to do it without the bins. The resulting file is just a number on each line, starting with the number of people with one comment, the next line is the number of users with 2 comments, and so on. I drop this file into Notepad2 to take a look, and (CTRL-C) copy all the data.

I open up R, and first execute this command:

x <- type.convert(readClipboard())

This loads all of the data I just copied into a “vector” called x. If you are unfamiliar with the format of R commands, note that the <- is an assignment symbol: it says put the stuff on the right into the box on the left. The readClipboard function–shockingly–reads whatever is on the Windows clipboard. Type.convert converts strings into integers, since the clipboard just assumes whatever you are copying is a string (or character) rather than a number. Now we have all this stuff in the vector x.

Next, I issue the following command:

plot(x, log="xy", xlab="log(number of comments)", ylab="log(number of users)")

which produces the plot shown to the right. It should be pretty clear what each of the options there does, creating a log-log plot of the vector x, with labels for each axis.

Next: Hypothesis testing!

Now we have some basic descriptions of the data, enough to give the reader a feel for what we are working with. Time to rearrange the data a few more times and take measurements that will help us answer questions about the relationship of feedback scores to posting behavior, in part 5.

[the making of, pt. 3] Creating a sample

alex — Mon, 22 Sep 2008 18:02:38 +0000

(This is the third in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Having put together the bones of a literature review in part 2, I now need to collect some data. My initial aim was a round number: 2,000 users, selected at random. There isn’t any particular reason for that number, other than it is small enough to handle, but large enough to lend a bit of power when it comes to the stats. It turned out that I actually ended up having to sample far more (30K), to get enough commenters, but didn’t know that until after I started. Research is messy.

Programming… shudder

Broadly, I wanted to collect a sample of individual user’s histories on Digg. This is made much easier because Digg, like many web services today, provides an API or “application programming interface,” a way of tapping into the databases that drive the system. The good news is that this makes it much easier to access the data if you know a little bit about programming. So, the unfortunate first step of this is learning to program, just a little, or engaging in enough “social engineering” to get someone who can program to help you. If you’ve done anything advanced in SPSS, SAS, Excel, or the like, you probably have done some programming without knowing it, but even if you’ve never done any programming before, taking this one step will help you way beyond this project. You don’t even have to be a good programmer (I’m not), you just need to be able to get the computer to do what you want it to. Really, programming is just a matter of being very explicit about what you want to do, and then putting that in a standardized format (computer language).

To that end, the first step I took was to download a Python installer (for free) here. One of the nice things about Python is that it is easy to learn, but powerful enough for people to do just about anything they want to. For what follows, I assume a very basic understanding of Python. If you already have some programming experience, see Mark Pilgrim’s Dive Into Python. There are lots of intros out there if you are just getting started: peruse your local bookstore.

Generally, when a service offers an API, someone will put together a tool (a “wrapper”) that makes it easy to access in Python and other popular programming languages. Derek van Vliet has written such a package for the Digg API, and so I downloaded it (PyDigg), again, for free, and put it in my Python directory.

A sample

I needed a sample of 2,000 user names. Like many systems, Digg appears to associate an incremental user ID with with the user name. So if you look at the code of my user page, you will find a snippet of code that reads:

suggesting that I am the 242,530th person to sign up for a Digg account. This is great: it means that if there were a way to get userpages by number, I could just keep feeding that in and end up with a sample. Unfortunately, I’m not sure that is possible (or at least I didn’t immediately see a way to do it through the API).

We also need to know how many users there are if we are going to sample them. Luckily, that’s pretty easy to find out here. As of the time of this posting, that number is just shy of three million. So, if we take random users from that list, we should end up with a workable sample. In practice, we may want to limit the sample to particular registration dates, or users that have a certain minimum of activity, but for now, we’ll just try for a straight sample of all registered users.

The script

The PyDigg site gives a short sample script showing how it works, and a glance through its code provides a good indication of the sorts of things it can provide for you. Here is the script I used to assemble a sample of 2,000 users, with some comments as we go:

from digg import *
from random import *
from time import *

We start by opening up three libraries that are needed for our task. The Digg library, obviously, gives us access to the Digg API, “random” gives us a function that produces random numbers, and we’ll use a function from “time” to slow down our requests to Digg, and not hog the resource.

TOTALUSERS = 2902633
APPKEY = 'http%3A%2F%2Falex.halavais.net'
SAMPLESIZE = 2000
PAUSE = 60
FILENAME = 'sampleusers.txt'

We start out with some constants. We could automatically get the total number of users via the API, but really, since this is a one-off script, that’s not necessary: easy enough to do it as a human, looking the the XML returned. The APPKEY is an application key that tells Digg who you are. Usually, it is a page for the program that is running. In this case, it is just a link to my blog, assuming that Digg can track me down that way. Please don’t use that for your own program :). The SAMPLESIZE is how many users I want to collect, and the PAUSE is the number of seconds I want it to idle between calls. Digg doesn’t give a hard number here, just asks you to be polite. A full minute may be overkill, but since you are going to wait anyway, why not just let it run overnight? Finally, FILENAME is where you want all this stuff stored.

Under other conditions, you would probably set up something to read this from the command line, get it from an initiation file, or query the user for it. But since I am going to be the only one using the program,and only using it once, I just hard code in the constants.

print "starting..."
d = Digg(APPKEY)
thesample = []

We start out with some basics. As with all the “prints” in the script, this one just lets us know what’s going on. The d= statement creates an instance of the Digg object; something that happens with any script that uses the PyDigg library so that you can access it. We also create an empty list of objects that will be our sample.

while len(thesample) < SAMPLESIZE:	
  cursor = int(random()*TOTALUSERS)	
  print "Getting data from offset ", cursor, '... ',
  users = d.getUsers(count=1, offset=cursor)

Here we begin the loop that makes up the program as a whole. As long as the length of our list (thesample)–which you will recall starts out at length zero–is less than the number we want (2000), we keep executing the instructions below. First we assign a random number from 0 to the total number of users to the variable “cursor”. Then, we use the “getUsers” method to pull one user record from that offset number. In practice, it would probably be good to have some error checking and exception handling here, but again–if we are only ever going to run this once or twice, that is probably overkill.

  if users[0] not in thesample:
    print "writing...",
    thesample.append(users[0])
    f = open(FILENAME,'a')
    f.write(users[0].name)
    f.write(',')
    if users[0].registered:
      f.write(users[0].registered)
    else:
      f.write('0') 
    f.write('\n')				# and go to the next line.
    f.close()

OK, this chunk writes out the record we have snatched, assuming we don’t already have it (we don’t want repeats in the sample). You might reasonably ask why we don’t just write it all out at the end, since we’re keeping a list of users in “thesample.” We could do that, but if for some reason the power goes out overnight, or you hit an error, or your cat hits CTRL-C, you have lost all that data. So, since we are taking our time anyway, why not do complete writes each time?

If this user object isn’t in our list (thesample) we add it in. We then open up our file to append a new line. The getUsers method returns a list of users; in our case, a list that includes only one object, but a list nonetheless. So we are concerned with the “zero-th” item on that list, and the “name” portion of this, which we write out to the file. We follow that with a comma (all the better to read in later, and I am assuming that commas are not allowed in usernames), and if the object has it, the date of registration. Then we close the file up, to make sure that nothing gets lost in a buffer if there is an abnormal termination.

  print "pausing...",
  sleep(PAUSE)
  print len(thesample), "finished."

We take a short breather, and print out the length of our sample thus far, before going back up and starting the loop again, assuming that we haven’t yet met our goal.

print "done."

Just in case it isn’t obvious.

In practice, again, things were a bit messier, and you may find the above doesn’t work out exactly for you. For example, you probably want to set your default character set to unicode, and do a little error checking. But in spirit, the above shows how easy it is to get stuff.

I won’t go into the process of pulling the comments into a comma-separated file, since it is largely identical to the above process. If you want to do this, take a look at the objects in the Digg library, and see what sorts of things they return.

From here

Having collected a lot of data, we need to figure out how to work it into something usable. Next (in part 4), I will talk a little bit of ways of dealing with all of this data and munging it into formats that are useful.

[the making of, pt. 2] Assembling the lit review

alex — Sat, 20 Sep 2008 02:47:42 +0000

(This is the second in a series of posts on the making of a research article. If you haven’t already, go back and check it out from the start.)

Overlapping sheets, not a point

A common response I get back from undergrads undertaking their first literature review is that they cannot find anything. This is usually an indicator of a systematic problem, since a usual issue is having too much to cover in a lit review, not too little. Generally, this comes in the form of “I can’t find anything on advances in the pizza delivery business in Peru during the 1990s.” If you could find something exactly on point for your work, you might have a problem: after all, what are you adding to the conversation?

While you should definitely look for “near misses” for your own narrow research topic, generally you are going to have stretch out a little broader: any papers or books on pizza delivery are probably worth checking out, whether or not they are in Peru, related to business, or set in the 1990s. Likewise, you are probably interested in the food delivery business in Peru more broadly as well. In other words, by assembling a quilt made up of existing slips of material, you can shape your own literature. Your work should be the point that binds together otherwise relatively disconnected pieces of work. Some of these may not fit together so neatly, others are in much better shape.

Keeping track

Generally, I keep a few documents going,at least when I am organized enough to do so. The main document keeps track of things I’ve found. This includes a full citation, and either relevant quotes (clearly indicated with quotation marks so that I don’t make mistakes later) or summaries of the material. (Many people keep these on notecards, or the digital equivalent, but I have always just kept it all in a document.) A second document records the citations and other information of things I think I should take a look at. A third document includes a list of key search terms and authors that might help in searching for materials.

I generally start with the last list, looking for a set of keywords, works, authors, or phrases that I can use to locate more information. I get these from my own assumptions, or sometimes from places like Wikipedia and other general sources of information. I then use these keywords in a series of places. I generally begin on Google Scholar, these days. I might also make use of ComAbstracts and similar resources. From these, I end up with a set of citations of things I should check out. For articles, I can generally check them from home. These days, I am also likely to check Google Books to see whether I really need a particular book or not before making the trek to the library.

As I go through these, I add to the search bibliography and search terms documents. Often, an article will contain little of interest other than its reference to other literature. It is a very iterative process. Eventually, I feel like I am seeing the same citations and names consistently, and have a handle on that particular segment of the literature.

Finally, for larger projects I try to assemble the bibliographic resources in a citation manager. I used Endnote for many years, with varying degrees of success. I have now switched over to Zotero, and finding it useful. Zotero is a free plug-in for Firefox that serves as a bibliographic manager, and provides a nice way to organize your resources and notes on the things you are citing. No matter your approach, make sure you have complete citations organized in some electronic format. Cleaning up citations and making certain the bibliography is correct seems consistently to take me three or four times as long as I expect it to.

As a practical matter, the above (like the cake) is a lie. I have generally internalized those three documents, and write the lit review as I am collecting information and sources. (Now, if only I could–like Jeremy Hilary Boob–write a positive review of my article as I was composing it!) If, however, you don’t have a lot of experience writing this sort of stuff, a bit of structure can’t hurt, and the “three document” approach has worked for students in the past.

Useful Areas

So what sort of things am I looking for? Obviously, I should at least look at anything that touches on Digg and the culture surrounding it and other collaborative filtering sites. There is surprisingly little literature, I suspect, that deals with the current crop of collaborative filters that involve interaction (Digg, Reddit, StumbleUpon, flock, and several dozen others). Much of what is out there will be technical in nature, and less likely to go into much depth in terms of users or social issues.

I also want to lay the foundation of operant conditioning, particularly as it applies to user interfaces and behaviors within a discourse community. It’s been a long time since I encountered this stuff in intro psych, and I’m not entirely sure where I’m going to find stuff here, or if I will. I suspect that since creating motivation to participate is particularly important in the area of education, there may be some work that looks particularly at reward systems and how they affect behavior in that literature.

Clearly, some of this has to do with the question of how people join groups and become acculturated to them. Although it is rare that such processes have such an obvious marker of success, it may be that there are models that can be applied to the processes observed on Digg.

There may be some work on reputation systems that is applicable. In particular, systems like those found on eBay (where you can be ranked up or down by those with whom you have interactions) might be applicable also to this work.

Finally, I’m hoping I can mine some of my own material. In fact, I’ll probably start there. I presented a paper about Digg and elections at the National Communications Association meeting last year, there is a paper I worked on with Alex Tan that analyzed the structure of eBay ratings, and, as I mentioned, I may be able to draw on a chapter from my dissertation.

Reputation Systems

Since I’m treating this a bit like a cooking show, I won’t go through the entire process, but I’ll walk through one of those pieces that will be woven into a literature review: the literature on reputation systems. As I noted, and to the consternation of traditionalists and my own surprise, Wikipedia has become a useful part of my research process, so I head over there for a survey.

Although the entry itself doesn’t go much beyond a definition, it does link to a few articles, including one I vaguely remember reading in CACM, and more importantly to a treasure cache: this site, which includes a bibliography of relevant research papers. Yay! Always nice when someone does a lot of the work for you. I go about assembling these, looking for information in them that informs my own work, and looking through their bibliographies to find common ancestors and theoretical foundations that are shared.

I also assemble a set of key phrases that seem to be coming up and run them through Google Scholar. Google Scholar is nice because it provides the ability to easily search for citing documents, and move forward to the most recent literature. In all, reading and writing about reputation systems takes only an hour or two, and yields about a paragraph of my literature review.

As with writing generally, if you are stuck, write something. Extract some reference, any reference, and start tunneling into it. Generally being stuck means you are not moving, and the easiest way to start moving is simply to start moving–in any direction at all.

And Everything Else

I generally find that I need to revisit the literature review after the research itself is done, but after working though the areas above and finding out how they connect, I’ve managed to set the foundation of my research, show that it hasn’t already been done (a common issue), and show that the questions I am asking are interesting and important, and will help to build on the existing literature. I ended up organizing the brief review under three headings:

* Filtering People and Content, where I talk about reputation systems and collaborative filtering
* Digg, where I talk more about how the site works
* Responses to Evaluation, where I look a bit at forms of learning and the process of becoming a member of a community.

From here, I have to figure out how it is I am going to do what I am going to do. Generally, this means planning out a method in some detail. The reasons for this are varied, but particularly in research that is collaborative, you will find that you need to propose the research before you can move forward. This may be a proposal to your committee, if you are a graduate student, or to a funding agency. If you are making use of human subjects, you will have to describe to a human subjects board what you plan to do in order to gain their approval.

Here, I know what I need (the public data from a sample of Diggers, organized in such a way that I can make sense of it), and so it is a matter of muddling through. In the next installment, I deal with getting an initial sample.

[Continue on to part 3…]

[the making of, pt. 1] Of sausages and research papers

alex — Wed, 17 Sep 2008 13:10:37 +0000

I’ve been meaning to do this for a while. When I get questions from students and colleagues, it is rarely relating to things I’ve found, but how I found them. How did you get that data and make sense of it? I do a poor job of explaining this, and many are unhappy when there isn’t a simple software tool that can accomplish what they want.

One of the advantages of blogging is the potential to live other people’s lives, and, as a result, learn a bit more about what they do. For an academic, blogs can (and often are) used to work through ideas, but more rarely is the process of doing research and writing threshed out. I’ve decided to do that for a paper I am writing, providing you with a blow-by-blow feel for how I do research, and how I write.

What I describe in the following posts isn’t the only way to do things, and it’s not even the normal way I do things, if such a thing exists. The process will change depending on the object of study, the type of work, and the team (or individual, as in this case) working on it. But hopefully it will provide you with some idea of how I go from nothing to a (hopefully interesting) research article.

There are dangers here. The first and foremost is looking like a complete idiot. But I’m kind of used to that, so I’m not going to worry too much about it. Hopefully, I’ll learn something out of the process. Another is that it messes with blinded peer review, but I’ve found that is already pretty messed up. If you have questions or comments, I’m very happy to hear them!

Ideation

The first step in any piece of research is thinking. I know that sounds obvious, and I only wish that it were. The idea is to come up with a piece of “specified ignorance” as Merton put it: something we should know that we do not know. (Note that this is different from “I can use method X and theory Y, now I’ll apply it to yet another set of easily found data.) It also should be something that you find to be exciting. Others will tell you that there are other requirements: that it be part of a sustained research agenda, that it be fundable, that it be of current interest to the profession, etc. I won’t disagree with any of that, but I don’t personally care very much about those things. I’m generally taken by fairly disparate kinds of issues and just let my curiosity get the best of me. Unsurprisingly (at least to me) these end up clustering together into some form of research agenda on their own.

Usually, I am inspired by “that’s cool” moments. A year or more ago, I found a lot of my time taken up by the Digg site. As it happens, I think that it and similar social filtering sites are really important to how the web works today, and we need to understand them better in general. I guess you could say they are becoming a stream in my research. But at the time, I just thought a tool that let you view your posting history, from Neaveru was pretty cool, and suggested some patterns in people’s behavior. Despite a self-image that claims not to care what other people think, I got pleasure from being widely “dugg,” and was frustrated when my comments were dugg down.

This got me thinking about the new explicit kinds of ratings of people that seem so common on the social web, from Technorati to Compare People to numbers of Twitter followers. All of this starts to feel a bit like distributed whuffie.

Of course, I’ve been thinking about related stuff for a while. And this new interest echos back strongly to a bit of a side-track I engaged in for one of the chapters of my dissertation (pdf). I never did the “make your dissertation a book” thing–I was sick of it at the time and now I wish I had done more with it after graduating. Anyway, I can at least fall back on some of the ideas I engaged there, including an analysis of how experience on Slashdot led people to learn to get more votes up from moderators.

Provisional Research Questions

So, I know I generally want to look at the function of the ratings of comments on Digg, and how they might be related to posting behavior. Although I’ve never been a fan of applying the strict hypothesis model to the social sciences, I do think you need to have some clear ideas of how you want to measure and operationalize your ideas. First, I need to narrow things a bit. Note that I expect that some of these ideas may not make it to the finished product (i.e., they may end up being stupid questions), and I may run across other things in the process that make me revisit my initial ideas.

I suspect, based on the work I did with Slashdot, that people learn to get better “grades” the longer they post on Digg. I’m not sure what “longer” really means, whether it is number of posts or amount of time, but it’s possible to check both. In the case of Slashdot, there appeared to be a “learning period” during which there was an increase in average post ranking. After this learning period, some people continued to post popular comments, while others abandoned popularity for obscurity, or actively posted trolling comments that would get ranked down.

The response of fellow Diggers to a comment also calls to mind something approaching a Skinner box, with small treat rewards for conforming behavior. I wonder if there is a direct measurable effect of positive or negative reinforcement of posting behaviors.

And, in a “mushier” sense, I’m wondering if there are traits that make it more or less likely for a comment on Digg to be voted up or down. I suspect that if I tried posting in French or Japanese, or just posted total nonsense unrelated to the discussion, this would get me voted down. So, being on-topic and writing in a language and style that can be comprehended seem important to a good Digg comment, but can we find certain kinds of things that tend to improve your chances. On Slashdot, humorous remarks were key to high rankings, for example.

So, I’m interested in three things (in a slightly different order than they are addressed above): Is a highly ranked post more likely to encourage a new post sooner? (And perhaps: is dropping out of posting related to consistently being Dugg down?) Do people tend to learn how to post more successfully over time? What content seems to result in the highest ranking?

From here

The truth is that my research and writing process is often pretty muddled, and I cannot promise that this will be any less the case this time. That said, in the following parts, I’ll address the process of assembling a brief literature review, writing up the method, collecting the data, analyzing the data, writing up the research, and going through the conference presentation and publication process.

Next: Part 2: Assembling a literature review