This Saturday morning, I will give a short presentation to a working group of the International Internet Preservation Consortium (IIPC) at the British Library on the role of weblogs in archiving (and archiving weblogs). The consortium consists of several national libraries, and our working group has the unenviable task of trying to determine what future historians will want to examine from today’s web. What do we save and why? The effort here is less to answer than to ask questions, and I am posting this outline in the hope readers may have ideas.
A recent conference helped to convince me that archiving the web will become much more difficult over the coming decade. This is particularly true because of the shift from a textual web to databases, images, and other mixed media. Tracking changes in a scientific database, for example, would likely be of importance to future researchers, but presents special challenges. On the flight back, I read the following in an article in the in-flight magazine, from a compiler of an online cuneiform lexicon:
Because it is electronic and will be continuously updated, it will be a living, eternally morphing resource that will never go stale. It will also make the teaching of Sumerian much easier.
Such a “morphing” resource makes the job of the archivist even more important. While regular publications might go through clear editions, and older editions may be retained, here the accumulation of knowledge looks more like a blackboard than a stone tablet. They still turn up such tablets in the desert, representing such things as a sale receipt for a goat purchase, and they remain intact because they are innately difficult to destroy. The content of the web, on the other hand, requires an active effort to copy and store, else it evaporates almost instantly.
The natural response from many of us who are not part of the library tradition is, naturally, “archive it all.” Even if today’s technologies make that a difficult prospect, proponents suggest, Moore’s Law is even more in evidence in the development of storage technology than it is in processor advancement. Consider how quickly the cost of a megabyte of storage has dropped over the last half-decade. Soon, we will be able to carry the Library of Congress in our wallet, and when that happens, archiving the Internet is sure to follow, at least if you believe the person who has the most experience with this.
The problem is that the content of the internet grows to encompass whatever local storage and bandwidth becomes available. While the number of hosts may not be growing exponentially, and Google has presumably cached the 3 billion pages it indexes, the problem is that these sites increasingly are capable of holding more and more data. Archivists are likely to have to continue to play catch-up perpetually. Moreover, even with contemporaneous web content, it is difficult to know what we have and where to find it.
The question remains, what are the major challenges to identifying materials of future interest, and how do we meet those challenges? More especially, what (if any) special challenges and opportunities do weblogs provide to the archival process. Weblogs represent the vanguard of the web, and they hint at the kinds of problems we are likely to encounter in archiving the rest of the web in the future. So understanding the difficulties and prospects found in archiving blogs can help inform the larger process.
Many still consider blogging a fad. It would seem, from the perspective and history of similar technologies, genuinely unlikely that it would become a significant part of how people interact with the world. Of course, the same might have been said of email, a technology that is now so ubiquitous that, with the exception of spam, its effect goes largely unnoticed. I suspect, in fact, that blogs are currently at the peak of public attention, and there will be a watershed period during which the number of blogs and bloggers plummets, followed by steady growth of blog-related technologies in the middle and long term. (That is, it will follow something like the Gartner Group’s Hype Cycle.) Of course, it’s not easy to predict the future of technology, but it seems likely that some form of distributed publishing will continue to take place through this century.
Even if blogging is more than a fad, many will suggest that blog content is ephemeral for a reason: that it is not worth archiving. There are three reasons to question this. First, some portion of weblog content is undeniably worthy of attention by future historians. At present, facts regarding the authenticity of documents “uncovered” in a broadcast report are being heavily discussed on weblogs. This presentation itself, I would hope, would be worthy of future consideration. Likewise, weblogs are increasingly the public journals of people in all walks of life. The fact that these weblogs are unedited means that much of it is not of interest to the casual reader. It is much more difficult to know whether a particular item will be of future interest, prompting an effort one news outlet called “saving Shakespeare’s blog”. Finally, social historians have largely had to “make do” with whatever has been saved: but it is wrong to assume that this indicates their preferred source. Indeed, many historians are most interested in the day-to-day life of an ordinary person, and weblogs may provide the such a window into ordinary lives.
At present, the enormity of the archival task is daunting. One of the defining characteristics of blogs is that they are updated frequently, in many cases daily. There are mechanisms that would make keeping an archive up to date easier (e.g., update pinging), but they are not employed universally. Identifying new blogs, and recognizing which blogs are no longer updated, can also be challenging. There have been efforts at taking a kind of moving snapshot of most of the blogs on the web (e.g., the NITLE Blogcensus), and several services regularly process the content of updated blogs in order to extract useful information (e.g., Technorati). However, outside of the Internet Archive, none of these is attempting to preserve continuously the full state of these weblogs. There are simply too many, changing too quickly. As more and more people keep blogs, and keep multiple blogs, we will likely find the increase in numbers of blogs (as well as churn) to continue to be a difficult problem.
Audio and video blogs, or textual blogs that contain these elements, represent a very small part of the whole, but these are likely to become more common as more people make use of camera phones and other mobile devices. At present, multimedia itself is often a size and bandwidth bottleneck on the web. The most widely viewed multimedia files have begun to be archived thanks to the Internet Archive’s freecache service, but less widely viewed multimedia is generally not archived. As we move beyond even Bush’s memex, and researchers begin to keep a continuous lifelog, we will be faced with the question of how to make sure local recordings are somehow distributed and archived in a useful way.
There is a significant amount of redundancy found in entries on weblogs. Despite being highly personal, many sites comment on news reports and add only personal opinion or analysis. An overly extreme analogy would ask whether keeping individual votes in an election would be worthwhile, or if the result is all that is important. Postings that merely reiterate or amplify information found elsewhere may be seen as less important to archive. Unfortunately, even information that is unique is often highly intertextual. Because so much of what is written on weblogs concerns current events, and because it is part of a continuing narrative, there are both temporal links to earlier changes, and hyperlinks to a broader web. Paradoxically, while many blog entries may seem redundant, most also rely heavily on context to be properly interpreted.
Finally, despite the fact that most blogs are produced by a fairly limited set of content management systems, there is only partial consistency in the structure of information in a weblog. There have been some early efforts to mine that structure in order to provide some kind of semantic tagging for the content, but the flexibility afforded to the users of most blogging systems makes it difficult to be able to consistently extract such information.
The success of weblogs over the last several years provide us with some suggestions for how to exploit the structure and nature of blogging to help inform the process of archiving. Weblogs present some structural conditions that can potentially make the archival process more effective, but perhaps more importantly, it may be possible to leverage the collective activities of bloggers to aid in the archival process.
There has been a great deal of discussion over whether the semantic web is more than just hyperbole, and especially whether it is as inevitable as early promoters suggested. The blogging community has become a kind of tinkerer’s testbed of semantic data, in which necessity has led to rough design and collective refinement. Nowhere is this as obvious as in the case of the development of RSS. For the search engine developer, and for similar reasons for the archivist, RSS represents a prime resource to be exploited. Changes to a site are presented in an easily accessible format, and data regarding their source and authorship are presented in a form easily parsed and archived. Other forms of metadata, from Dublin Core meta tags to Creative Commons licenses in RDF format, are also more frequently found on weblogs than on other websites, at least for the time being. Public RSS aggregators like Bloglines already archive the output of a large part of the web, and provide an additional approximation of the readership of various weblogs.
In almost all cases, weblogs produce their own self archive. That archive is particularly tenuous. The author or authors of a particular weblog is normally able to change the archived content without a clear indication that such a change has taken place. More importantly, however, the archive only lasts as long as the initial author keeps the weblog. Many weblogs are short-lived, and in any event, we can assume that all weblogs are likely to be kept in operation for a finite amount of time. These local archives need to be duplicated elsewhere. At present there is nothing as simple as RSS that allows for these archives to be duplicated. However, similarity in archive addressing patterns allow for the ability to extract these archives in most cases. This means that an archiving effort need not keep up with the quick pace of changes on many weblogs, and instead can rely on archives for these sites. If a standardized way accessing these archives were evolved within the blogging world, it is possible that it would be adopted by other frequently-updated websites.
The fate of centralized libraries in the distant and recent past should guide us toward ways of distributing archives. Automatically accessible self-archiving is one way to do this, but there are at least two others. A project analogous to (or based on) the Freenet project might be another. A distributed archive that made use of some part of a large group of users’ hard drives, along with error correction and encryption, would provide an extraordinarily resilient archive of the web. Users could select data that should be archived (perhaps automatically by mining the local browser cache), and contribute resources toward the storage of the archive. In many ways such a peer-to-peer archive is a very promising direction for archiving over the long term.
In the near term, users may be exploited to identify information to be stored. Presumably, this resource identification feature is one of the functions of the Alexa toolbar. Bloggers routinely annotate the wider web, identifying resources that are of interest and indicating why they might be valuable. This ongoing review of material provides a way for archivists to locate material that is of particular popular interest at any given time. Indeed, services like Blogdex or Daypop would provide such a list of “sites of interest.” Alternatively, del.icio.us provides an ongoing collection of such links, and Furl takes this a step further, archiving pages on a user’s request. Another alternative is an extended form of self-archiving, which locally caches copies of pages linked from a weblogs. Although there has been interest expressed in such a system, I am not aware of one being used at present.
Given that such collective approaches to archiving are already emerging organically from the blogging community at large, and the advantages of distributed archives over localized copies, it seems to me that such approaches should be investigated and encouraged, and that a marriage between distributed efforts and more centralized repositories might be explored.