XML Parsing

I’ve said it many times: I am pretty techy for a social scientist, but that ain’t saying much. So, I needed to find a way to harvest the lists of recent updates from weblogs.com and store them in a flat file for sampling and using in a variety of other projects. Originally, I was just going to archive the HTML file every few hours, so as not to lose the information (it only lists a few hours’ worth of updates). But, that seemed silly given that they provide and encourage use of an XML feed.

I have never really dealt with XML parsing, so this looked like it might be more trouble than it was worth. The problem was not a lack of information on the subject, but information overload. I considered going through the O’Reilly book on Python and XML, but that seemed like serious overkill for such a tiny project. Thank goodness I found Dive into Python which spends a short chapter talking about the minidom parser. It was as simple as I suspected, but that short walkthrough really made it clear as well.

Anyway, it’s a dumb little script, but I’ll put it up in case someone wants it. All it does is poll “changes.xml” every 2.5 hours, and updates a tab-delimited list of URLs (adding to it if new ones are encountered) with the most recent update time. I’ll test it overnight, then post it in the morning. The file it generates will be fed to a multithreaded parser some time soon, which will go and suck up the new text for analysis.

This entry was posted in Uncategorized and tagged . Bookmark the permalink. Trackbacks are closed, but you can post a comment.

One Comment

  1. Posted 3/3/2003 at 9:52 am | Permalink

    Hooray!!

Post a Reply to Kara

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>