Masters of PageRank

bOing bOing has a note that indicates entering “http” into Google spits out the top PageRanked pages of the Web. I see no reason for this not to be the case. They provide the top ten, I extracted the top 1000. Download: google1k.txt.

What use is it? Well, if you can get links from these pages, your PageRank will probably soar. If you were looking for “important pages on the Web” as a sample for some sort of analysis of Web content, this list might be a worthwhile one. (Though a few of those entries, I suspect, will become dated quickly–e.g., the Johannesburg summit site.)

This entry was posted in Uncategorized and tagged . Bookmark the permalink. Trackbacks are closed, but you can post a comment.

5 Comments

  1. Posted 11/13/2002 at 4:33 pm | Permalink

    i scooped your text file and made the http 1000. but hey – there’s only 950! how did you make that list?

  2. Posted 11/13/2002 at 5:31 pm | Permalink

    Drat! I don’t know how that happened. I’ll take a look. I crawled the first 10 100-long pages (all it would give me) rather than using the API, then deleted the links to google or with only an IP address. I suspect that some of the IP addresses may have been real sites (I was assuming they were google cache). Maybe I’ll try to clean it up, though in retrospect I’m not sure whether it’s worth it :).

  3. Posted 11/13/2002 at 5:36 pm | Permalink

    So, it turns out that last page only has 58, not 100, I guess the other 8 are likely to be those that were IP addresses (or something). I take it back, for an error of 8, it’s prolly not worth redoing.

  4. Posted 11/13/2002 at 5:41 pm | Permalink

    you’re right – prolly not worth redoing. although it might be interesting to do it again in a month or two and see the changes. let me know if you do another crawl.

  5. Posted 11/13/2002 at 7:39 pm | Permalink

    Will do.

Post a Reply to alex

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>