…and know when to run. I reported recently on the failure of a fairly time-consuming research project, a project I had hoped to publish in a special issue of JCMC. I had hypothesized that if collaborative filtering systems led to a narrowing of topics or perspectives, we might be able to measure how and why. Slashdot was the prototypical test bed. I parsed in about 67,000 comments for analysis. One possibility was that when people posted within groups that they had posted before, their scores would be higher. I was fairly certain, going in, that this would be the case. Unfortunately, there seemed to be no correlation at all between co-posting and scores. This was frustrating, to say the least.
So, thought maybe my approach was wrong-headed. Maybe, in fact, collective filtering actually favored the ideas and postings which were unique, or at least unusual. So, I turned to the tool I’ve been using on a bunch of projects, the word frequency differences. The more different a comment was, semantically (or at least in terms of word-frequency–there are still some validity issues here), the higher the score would be.
Having spent at least another man-week on re-parsing and analyzing the data, this too produces un-stunning results. In fact, there is a negative correlation between score and originality, of about -.16. That is to say, the more unusual your posting, the less likely it will receive a high score. Unfortunately, the correlation in really too weak to be of much import. I suppose it is an interesting finding, but not interesting enough that it is really publishable.
So, that is really a bummer. Not only did I let down Han Park, one of the co-editors for the issue, and pass up an opportunity to get some of my (non)work out into the public eye, I wasted valuable time that could have been spent on a number of other projects.
I guess the only up-side is that I can now move on to other projects. I will be working this week on the hate speech project I had begun with Kim-Alla last year. She is too busy with law school to invest much time in it, I think, so I am going to move forward with some of the quantitative analysis, and once I have the results there worked out, I’ll bring her back in for more of the qualitative analysis. She has already done a great job in helping with the conceptualization and drafting a good segment of the methods.
Everyone I have talked to about this has said “at least you have the data.” I think this is a poor consolation prize. It’s too large to put up on the web, but I will burn it off to a couple of CDs. I can’t imagine that it is very useful, but I suppose if anyone has a viable use for it, they can contact me. It’s basically a parsing out of each comment (not the topic heads) for the first 15 days of September. I have the raw data for the first two weeks of December as well, and this could be parsed if there is some interest.
I suppose one area of interest is that I do have the co-posting network for this data set. That is, there were about 67,000 comments, and these fell in a little over 400 topics, so I have the network of people who posted, with tie strength measured as the number of topics in which they shared a post. I also have lists of people for each topic and topics for each person. Like I said, if someone wants this, let me know and I can burn off a copy for you. But like the title says, I’m burying this thing.