My power-lawed blog

Over the last several years, a number of researchers have written about the hyperlinked structure of the web and how it changes over time. It appears that the natural tendency of the web (and of many similar networks) is to link very heavily to a small number of sites; the web picks winners. Or, to be more accurate, the collective nature of our browsing picks winners. As users forage for information, they tend to follow paths that are, in the aggregate, predictable.

Huberman notes that not only are the surfing patterns of the web regular, the structure of the web itself exhibits a number of regularities, particularly in its distribution of features. The normal distribution of features found everywhere–the bell-shaped curve we are familiar with–is also found on the web, but for a number of features, the web demonstrates a “power law” distribution. George Kingsley Zipf described a similar sort of power law distribution (Zipf’s Law) among words in the English language, showing that the most frequently used English word (“the”) appears far more often than the second-most frequently used word (“a”), which appears far more often than the third-ranked word, and so on (Yes, yes, I know: Zipf is ranks and so it’s different, but not different enough to matter for this discussion.) This distribution–magnitude inversely proportionate to rank–has shown up in a number of places, from the size of earthquakes to city populations.

The number of “backlinks,” hyperlinks leading to a given page on the web, provides an example of such a distribution. If the number of backlinks were distributed normally, we would expect for there to be a large number of sites that had an average number of backlinks, and a relatively small number of sites that had very many or very few backlinks. For example, if the average page on the web has 2.1 backlinks, we might expect that a very large number of pages have about two backlinks, and a relatively small number to have one or zero backlinks. In practice, a very large number of pages have only a single backlinks, a much smaller number of two backlinks, and an again much smaller number have three backlinks. The average is as high as 2.1 because of the small number of sites that attract many millions of backlinks each. Were human height distributed in a similar fashion, with an average height of, say, 2.1 meters, we would find most of the globe’s population stood under a meter tall, except for a handful of giants who looked down at us from thousands of kilometers in the sky.

MyBlogInlinksMyBlogCommentsHuberman notes that this distribution is “scale-free”; that is, the general nature of the distribution looks the same whether you are examining the entire World Wide Web, or just a small subset of pages. I have been blogging for several years, and each blog entry ends up on its own page, often called a “permalink.” I took a look at the last 1,500 of my posts, to see how many backlinks each one received. The first figure to the right shows a ranked distribution of incoming links, not including the first-ranked posting. The vast majority (1,372) of these 1,500 pages do not have any incoming links at all. Despite this, the average number of backlinks (=”inlinks” in the figure) is 0.9, driven upward by the top-ranked posts. Incidentally, as the second graph shows, the number of comments on each of these entries follows a similar distribution, with a very large number of posts (882) receiving either a single comment or none at all. In order to make these figures more legible, I have omitted the most popular post, entitled “How to Cheat Good,” which was the target of 435 backlinks by August of 2007, and had collected 264 comments.

One reason to explain why such a distribution exists is to assume that there were a few pages at the beginning of the web, in the early 1990s, and each year these sites have grown by a certain percentage. Since the number of pages that were created has increased each year, we would assume that these older sites would have accumulated more links over time. Such an explanation is as unlikely on the web as it is among humans. We do not grow more popular with every year that passes; indeed, youth often garners more attention than age. There are pages that are established and quickly become “hits,” linked to from around the web. While it cannot explain the initial rise in popularity, many of these sites gain new backlinks because they have already received a large number of backlinks. Because of the structure of the web, and the normal browsing patterns, highly linked pages are likely to attract ever more links, a characteristic Huberman refers to as “preferential attachment.”

Take, for example, my most popular recent posting. The earliest comments and links came from friends and others who might regularly browse my blog. Some of those people linked to the site in their own blogs. Eventually, it came to the attention of several widely read and popular blogs, including Michael Froomkin’s “Discourse.net” and Bruce Schneier’s “Schneier on Security.” Someone noticed it on the latter blog, and a link was posted to it from “Boing Boing,” a very popular site with millions of readers. Naturally, many people saw it on Boing Boing and linked to it as well, from their blogs and gradually from other web sites. Eventually, I received emails telling me that the page had been cited in a European newspaper, and that a printed version of the posting had been distributed to a university department’s faculty.

It is impossible for me or anyone else to guess why this particular posting became especially popular, but every page on the web that becomes popular relies at least in part on its popularity for this to happen. The exact mechanism is unclear, but after some level of success, it appears that popularity in networked environments becomes “catching.” The language of epidemiology is intentional. Just as social networks transmit diseases, they can also transmit ideas, and the structures that support that distribution seem to be in many ways homologous.

Posted in Uncategorized | Tagged | 2 Comments

Not-so anonymous edits

Just a quick link in case people don’t follow Wired News these days. They have an article about Wikipedia Scanner (down, at the moment, due to the overload), a mash-up of wikipedia data and a reverse DNS lookup to show where anonymous posters are posting from. Looking up suspicious IPs isn’t hard to do, but indexing those anonymous edits is very interesting, since you can track down all the anonymous edits that are from a particular IP range (e.g., the Republican Party headquarters or the CIA).

Of course, those people making anonymous edits with their IPs showing were either (a) not trying especially hard to hide their manipulations or (b) didn’t know any better. I am guessing that in most cases it is the former. Keeping your identity quiet on Wikipedia is fairly easy. First, anyone can have an account, and accounts (I believe) don’t show the IP. But that’s a superficial layer of protection. You can always make edits from an internet cafe, or through Tor, or both!

Wikipedia Scanner needs to either backstop their servers to handle the flash crowds, or provide the database as a torrent. I suspect they can’t do the latter because they didn’t use open source GeoIP data (doh!).

Posted in Uncategorized | Tagged | 1 Comment

Pool on the Net

Technologies of FreedomOne of my favorites:

If media become “demassified” to serve individual wants, it will not be by throwing on lazy readers the arduous task of searching vast information bases, but by programming computers heuristically to give particular readers more of what they chose last time. Computer-aided instructional programs similarly assess students’ past performance before providing the instruction they need. The lines between publication and conversation vanish in this sort of system. Socrates’ concern that writing would warp the flow of intelligence can at last be set to rest. Writing can become dialogue.

Ithiel de Sola Pool, Technologies of Freedom, Cambridge, Mass: Belknap Press, 1983, pp. 230-1.

He goes on to suggest that electronic media tend toward freedom, but we have to craft policy to keep them that way.

Posted in Uncategorized | Tagged , | 5 Comments

Two great movies, for very different reasons

Rocket ScienceI have been trying, desperately, to write over the last few weeks, but have gotten out to see two movies: Bourne Ultimatum and Rocket Science. I don’t have time to do full reviews of them, but since I enjoyed both a lot, I’ll do hundred-word versions.

Bourne Ultimatum: With so many action movies lately it seems like it would be hard to do, for example, a car chase that manages to be even marginally believable and hasn’t all been done before, but the film manages it it. It also manages, as with the earlier movies in the trilogy, to do fights that are reasonably realistic. Hollywood (cf Casino Royale) has finally figured out that people who have been trained to fight do things other than throw punches. The pace is exhilarating, and as long as you don’t expect too much in the way of character development, and don’t get too motion sick from the moving camera, this is a must-see.

Rocket Science: You won’t know if you’ll like this until you see the trailer, but, unfortunately, the trailer manages to capture some of the best moments in the film. I hate it when they do that! The rest of the movie is also excellent, and it holds everything together well, but if you are smart, you will take my advice and just go and see it. Chances are, you already know if you like this kind of indy, quirky, type of flick (Dollhouse, Life Aquatic, Little Miss Sunshine), and if you do, you should make your way out to see this.

In other words, after a somewhat lackluster year, these are both very nice summer films.

Posted in Uncategorized | Tagged | 2 Comments