Obtaining a sample of blogs is–as with sampling on the web in general–not as easy as it should be. One way I’ve done this in the past is to sample from a “ping server.” Most weblog systems and services send pings out to updating services to let them know that new content has been posted. One of these servers is located a weblogs.com. Moving forward, I hope to use blo.gs, now that it has been bought up by Yahoo and is back in business. They’ve recently OKed my access, but I need to change my system a bit to draw pings from that server.
The last time I pulled from weblogs.com, it worked pretty well. Basically, I am working with a couple of other people to content analyze a relatively representative sample of weblogs. So, if I gather all the pings from a week from weblogs.com, and pull random blogs from that, I should have a decent sample. I have a couple of restrictions: that it be apparently single-author, that it be written primarily in English, and that it not be primarily commercial in purpose. This necessitates sifting through the sample by hand, and I’m doing this 100 blogs at a time.
I was prepared for the number of splogs (spam blogs). It doesn’t take much to notice that spam has infested the blogosphere. I was less prepared for the number of Asian-language blogs, particularly Chinese.
I kept an informal count of the last 100 I looked at. Of these, only 23 met the requirements I laid out above. 44 were splogs, 20 were written primarily in Chinese, and 12 in another language (four in Japanese, a couple in Thai, a couple in Farsi, one Russian, one Portuguese, and two in Spanish). I think it would be a mistake to extend this and suggest that 20% of the blogosphere is Chinese, but from an informal, personal perspective, the increase in Chinese-language bloggers is striking.