Know when to walk away…

…and know when to run. I reported recently on the failure of a fairly time-consuming research project, a project I had hoped to publish in a special issue of JCMC. I had hypothesized that if collaborative filtering systems led to a narrowing of topics or perspectives, we might be able to measure how and why. Slashdot was the prototypical test bed. I parsed in about 67,000 comments for analysis. One possibility was that when people posted within groups that they had posted before, their scores would be higher. I was fairly certain, going in, that this would be the case. Unfortunately, there seemed to be no correlation at all between co-posting and scores. This was frustrating, to say the least.

So, thought maybe my approach was wrong-headed. Maybe, in fact, collective filtering actually favored the ideas and postings which were unique, or at least unusual. So, I turned to the tool I’ve been using on a bunch of projects, the word frequency differences. The more different a comment was, semantically (or at least in terms of word-frequency–there are still some validity issues here), the higher the score would be.

Having spent at least another man-week on re-parsing and analyzing the data, this too produces un-stunning results. In fact, there is a negative correlation between score and originality, of about -.16. That is to say, the more unusual your posting, the less likely it will receive a high score. Unfortunately, the correlation in really too weak to be of much import. I suppose it is an interesting finding, but not interesting enough that it is really publishable.

So, that is really a bummer. Not only did I let down Han Park, one of the co-editors for the issue, and pass up an opportunity to get some of my (non)work out into the public eye, I wasted valuable time that could have been spent on a number of other projects.

I guess the only up-side is that I can now move on to other projects. I will be working this week on the hate speech project I had begun with Kim-Alla last year. She is too busy with law school to invest much time in it, I think, so I am going to move forward with some of the quantitative analysis, and once I have the results there worked out, I’ll bring her back in for more of the qualitative analysis. She has already done a great job in helping with the conceptualization and drafting a good segment of the methods.

Everyone I have talked to about this has said “at least you have the data.” I think this is a poor consolation prize. It’s too large to put up on the web, but I will burn it off to a couple of CDs. I can’t imagine that it is very useful, but I suppose if anyone has a viable use for it, they can contact me. It’s basically a parsing out of each comment (not the topic heads) for the first 15 days of September. I have the raw data for the first two weeks of December as well, and this could be parsed if there is some interest.

I suppose one area of interest is that I do have the co-posting network for this data set. That is, there were about 67,000 comments, and these fell in a little over 400 topics, so I have the network of people who posted, with tie strength measured as the number of topics in which they shared a post. I also have lists of people for each topic and topics for each person. Like I said, if someone wants this, let me know and I can burn off a copy for you. But like the title says, I’m burying this thing.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.


  1. Posted 2/1/2003 at 10:22 pm | Permalink

    Everyone knows that all you have to do to get a higher score on Slashdot is to either say something good about Linux or something bad about Microsoft =)

  2. alex
    Posted 2/2/2003 at 4:44 pm | Permalink

    This bit of conjecture actually kept popping up in the back of my head while I was doing this. In fact, the data is there and ready to be analyzed. I doubt finding the validity of this claim constitutes something important to the academic community, but it would be a fine way to get Slashdotted.

    The open question is how to define positive or negative. You could easily check to see if MS or Tux related words showed up in a post and whether that affected its score. You could also do some basic content analysis, using off-the-shelf products to decide whether they were talking about each in a positive or negative light. But you quickly see how some structures (“nothing sucks as much as a Linux GUI that replicates the Windows paradigm”) or sarcasm (“No problem: just run regedit on mame and everything should be solved.”) wouldn’t really get coded right.

    Now what we could do is pull out the comments that use any of these keywords and then hand code them for positive or negative mentions. Are you volunteering? I didn’t think so :).

    At any rate, I have a feeling the result would be about as successful as the other two attempts. Even though this is the common perception, I suspect that mentions of either have no discernable effect on one’s score.

    The hard truth may be that score is really most affected by the quality and utility of the post, surprisingly enough. Unfortunately, there is no good way to judge quality and utility besides asking people, and that survey is already happening on a daily basis on Slashdot.

