Tonight in seminar, John noted that he had a problem. He needed the URLs generated by Yahoo search. However, if you try searching for something on Yahoo, say cheese and you mouse over the first result, “I love Cheese!” and look down to see the url of the hyperlink in the message bar, you will find… well, that it is a bit of a mess. It isn’t the expected http://www.ilovecheese.com. The reason for this, you may have guessed, is that Yahoo uses a script to track the outbound clicks. (If they are smart, they are using this to rank their results, leading to the most oft-clicked results showing up at the top.)
But John has a problem. He needs to collect the proper links from a *lot* of such pages, and doesn’t want to fuss with reading through the page source for each of them. So, instead, he saves the page, and runs it through a program that will extract the links for him. I put together just such a program, and below, I will explain how it works. If you want to run the program, you will need to download a free copy of the Python interpreter here and install it on your computer.
Python programs, like HTML, can be written in a regular plaintext editor like Notepad. The programs are normally saved as “something.py” instead of “something.txt”. Below, I’ll have the program bits in blockquotes, with explanations interspersed.
# A script for John that strips out the search URLs from
# a saved Yahoo search response page
# Alex Halavais, email@example.com
# 15 June 2005
# We need the "regular expression" module
Any part of a program that starts with # is just ignored by the interpreter. It is meant for human-legible comments. In fact, code with enough comments, and sensible naming conventions, doesn’t need to be documented: it is self-documenting.
Anyway, the only read “code” part of this is the request to “import re.” When you have a programming task, you often need to import several libraries. Libraries are a bit like toolboxes: they contain the tools you might need on any particular programming job. In this case, I am going to be doing some pattern-matching of text, and so I want to use regular expressions, which is a kind of pattern-matching language. You can use regular expressions in many computer languages, and they are a bit tricky when you get started, but make more sense after a while. Just as you can use wildcards to search for things in certain systems (fish* gets you fishing, fisher, fishsticks, etc.), regular expressions allow you to finely tune different kinds of wildcards.
OK, so we have our toolbox, on to the next bit…
# We are going to set up a regular expression
# pattern that catches each of the links.
# This relies on noticing that the link anchors
# are assigned to the class "yschttl"
# (Yahoo search title?). The url itself is a
# long internal link with a lot of stuff in it
# we can ignore until we get to the "A//"
# part. That's the beginning of the URL we want.
# We collect everything after that until we hit a quotation mark.
URLS = re.compile('a class="yschttl" .*href=".*A//([^"]+)')
This is, again, mostly comments. That last line is scary enough that most people decide immediately after looking at it that they will never learn to program and should go and meditate in the woods. Don’t worry, it looks to everyone like that. Basically, it says I’m looking for a link statement in the HTML code that is associated with the class “yschttl” (Yahoo SearCH TiTLe? Yeti SCHool TurTLe?). Really what I am after is the piece between the inner parentheses, which reads [^”]+ and basically means “give me everything at this point until you hit a quotation mark.”
Still don’t get this part? No problem. You can always put off learning about regex (regular expressions) until you are 40 — no one will think less of you for it, and there are often other ways at getting at the same stuff.
# First, we need to ask what the input and ouput file names are
fileIn = raw_input('What is the name of the saved html file? ')
fileOut = raw_input('What would you like to call the output file? ')
In this part, we gather the names of two files, and store them in variables called “fileIn” and “fileOut.” I could have called them foo and bar, or anything else I liked. But sine they will be storing the file names, it makes sense to label them properly.
When I use the = sign here, really I am saying “put the thing on the right into the variable on the left.” In fact, in some languages, a < - is used instead, and that makes some sense. Anyway in each of these cases we are putting in a "raw_input." What is that? It is whatever the user types in. Further, we are telling the computer that before finding out what the user types in, it should print out a quick question. When the interpreter gets to this part of the program, it will print out the question and wait for the user to type in an answer. Then the answer will be stored in fileIn. It then does the same thing for fileOut.
# Now, read in all the text in the file indicated and store it in “inText”
inText = open(fileIn).read()
Once again, I am putting stuff into a variable, this time the variable I’ve named “inText.” What am I putting into inText? Well, I’ve saved some space by telling the interpreter to do a couple of things at once. I want it to open a file called… well, whatever the name is that is stored in fileIn. Then, using that open file (I guess that’s one way to interpret the “.” there, as “using this do that”), I want you to read in everything. So all the HTML in that file is shoved into inText. inText can hold a virtually unlimmited amount of stuff, so don’t worry how much text is in the file.
# Make a list of all the the
# things in the text that match the pattern above
# and store it in "theList"
theList = URLS.findall(inText)
Here we get to use one of the tools in our “re” (regular expression) library. The tool “findall” takes a pattern (which we defined up above as the URLS pattern) and compares it to some text (in this case, the text held by “inText”). Any matches it finds to that pattern, it puts in a list. That list of matches is stored, naturally, in “theList.”
# Open up an output file to write intoOK, last time we opened a file, we did something with it right away, using the “.” operator. This time, we want to hold onto the open file for a while, so we are going to put it in “f” (you know, for “file”). The file we are opening will be named whatever is stored in “fileOut” and it will be for “w”riting to, rather than reading. (We could have used a ‘r’ up above, but when you don’t specify, Python assumes you want to read a file.) We will be using “f” to manipulate this file for a while, before we finally close it back up.
f = open(fileOut,'w')
# for each of the items in the list
for eachItem in theList:
# Write out the http:// part that we stripped out above
Now we are getting a bit fancy. Computers are good at doing things over and over and over. We want it to go through each of the URLs we found earlier and do something with it. Luckily, that command looks a lot like English. We want it to consider each item separately, and do some things with it. All the stuff we want to do with it will be indented a bit.
The first thing we want it to do, is write the text ‘http://’ out to the file. That “.” is showing up again. It says, “take the object ‘f’ and do the following with it,” and then tells it to write some text to the object “f” (the file).
# write each item
# hit return (write a "newline" character)
Now we want it to write another thing to the file, this time whatever item it is we are considering at the present. It will go through these lines for each item on the list, writing each item once to the file.
After that, we want it to write an “enter” or “return” key, also called a “newline,” and represented by the code “n”. It needs a special code, because how else can you tell it to write an “enter”?!
So for each of the items on the list, it will do three things: write “http://” to the file, write one of the URLs it found to the file, write a newline to the file.
# Close the file.
Note that we are no longer indenting. This thing we expect it to do only once. Take the file “f” and (.) close it.
That’s the whole program. When you run it, and enter the name of a saved HTML file from Yahoo, it strips out the URLs and writes them to a file you specify.
There are two ways to run this program. You have installed Python, right? Can’t do anything without that. Once it is installed, on windows any file with the .py extension will appear with a smiley green snake icon. You can just double-click this and the program should run. Alternatively, from the command line (remember that?) you can type “python yahoome.py” and the program will be run (interpreted) by Python.
With a couple more lines of code, this can be extended to check all of the files in a particular directory for the pattern. Still more lines of code, and the program can go and get the pages directly from Yahoo and “scrape” out the URLs. And chances are, with a weekend or two of work, you could be writing programs like that.
Update: If you want to try it, you can right-click and save this zip, which contains the program above and a version that does the whole directory.