Despite one of my New Years Resolutions, I don’t seem to be able to get ahead of the curve. To wit:
I have promised to an editor (hi, Han!) a paper examining the effects of collective filtering on Slashdot posts. I had all of the data together — the initial set is all of the postings for the first couple of weeks of September — but hadn’t anticipated the trouble in processing, for example, a matrix of almost 13000 square. I’m not really doing any network-related manipulations of this data, but just assembling and processing it (the file is in the 600MB range) has been a pain.
One of the reasons it has been a pain is that I bought 512MB of pulled memory on Ebay that turned out to really be 256MB. I feel pretty dumb about it, since I had been running it for a couple of weeks before this work made it clear something was wrong. I figured one of the two DIMMs had gone out, but when I checked it, it was clear that they were a couple of 128 sticks that the seller had just slapped with Crucial 256MB stickers. Given that it isn’t always obvious — unless you watch the startup screen or ask the OS — how much memory you have, I wonder how many other times he’s gotten away with this.
So, I’m doing the runs at work and trying to shuttle back and forth to check them. Unfortunately, they have me on Windows 98 at work, and python’s array module (numarray) only works with physical memory on 98 for some reason. So the computer dies if you ask for a matrix (of integers) larger than about 7000 square. Instead, I have to use python’s built in lists, which take longer to process. I’m looking at about 10 hours each run at work, which would be great if I could FTP it home, but since I can’t access my desktop from home (why not?!), I have to run in and get each run and put it on a CD to work on. I could just go in and work at work, but I would have to come home periodically to take care of the Finn anyway. I am hoping that when I get the data today, it will have processed correctly and I can move forward.
The research proposes two hypotheses. Slashdot, as you may know, allows for anonymous moderators, chosen at random, to increase or decrease by one the score of postings on the site. These scores can run from ==-==1 to 5. I suggest that there is a corellation between posting to topics in which your “friends” (those with whom you often post) are present and the score on a posting. This may seem obvious, but there are a number of factors that cannot really be measured. It is not usually the friends who are doing the scoring, for example. That is to say, the score should be indicative of what readers as a whole think of the value of a contribution. I hypothesize that those postings made in “familiar” settings tend to be held in higher esteem by the community at large — i.e., are of better subjective quality.
Assuming that this corellation is present. And I should know this by tomorrow morning at the latest, the second hypothesis is that this tends to lead to more postings among those who are your “posting buddies”: that is, birds of a feather flock together. This may seem like an obvious effect, but since there is undoubtedly churn in the system, affiliation numbers from September are likely to be lower by December, all else being equal. So, I am suggesting (hoping?) that the affiliation scores increase significantly.
All of this is a bit wing-and-a-prayerish. It may be that there is no corellation between posting scores and clustering of posters. It may be that posters do not change their posting behavior to target narrower interests over time. Even if both of these are true, it is unclear that the causation is absolute: there are a large number of potential intervening variables. That is, however, the nature of social research. I feel more out on a limb on this one, when compared to other projects, since if these hypotheses do not wash, I’m pretty much back to the drawing board. Meanwhile, my January 1 deadline has come and gone.