I’ve said it many times: I am pretty techy for a social scientist, but that ain’t saying much. So, I needed to find a way to harvest the lists of recent updates from weblogs.com and store them in a flat file for sampling and using in a variety of other projects. Originally, I was just going to archive the HTML file every few hours, so as not to lose the information (it only lists a few hours’ worth of updates). But, that seemed silly given that they provide and encourage use of an XML feed.
I have never really dealt with XML parsing, so this looked like it might be more trouble than it was worth. The problem was not a lack of information on the subject, but information overload. I considered going through the O’Reilly book on Python and XML, but that seemed like serious overkill for such a tiny project. Thank goodness I found Dive into Python which spends a short chapter talking about the minidom parser. It was as simple as I suspected, but that short walkthrough really made it clear as well.
Anyway, it’s a dumb little script, but I’ll put it up in case someone wants it. All it does is poll “changes.xml” every 2.5 hours, and updates a tab-delimited list of URLs (adding to it if new ones are encountered) with the most recent update time. I’ll test it overnight, then post it in the morning. The file it generates will be fed to a multithreaded parser some time soon, which will go and suck up the new text for analysis.