Blogs and Archiving

This Saturday morning, I will give a short presentation to a working group of the International Internet Preservation Consortium (IIPC) at the British Library on the role of weblogs in archiving (and archiving weblogs). The consortium consists of several national libraries, and our working group has the unenviable task of trying to determine what future historians will want to examine from today’s web. What do we save and why? The effort here is less to answer than to ask questions, and I am posting this outline in the hope readers may have ideas.

A recent conference helped to convince me that archiving the web will become much more difficult over the coming decade. This is particularly true because of the shift from a textual web to databases, images, and other mixed media. Tracking changes in a scientific database, for example, would likely be of importance to future researchers, but presents special challenges. On the flight back, I read the following in an article in the in-flight magazine, from a compiler of an online cuneiform lexicon:

Because it is electronic and will be continuously updated, it will be a living, eternally morphing resource that will never go stale. It will also make the teaching of Sumerian much easier.

Such a “morphing” resource makes the job of the archivist even more important. While regular publications might go through clear editions, and older editions may be retained, here the accumulation of knowledge looks more like a blackboard than a stone tablet. They still turn up such tablets in the desert, representing such things as a sale receipt for a goat purchase, and they remain intact because they are innately difficult to destroy. The content of the web, on the other hand, requires an active effort to copy and store, else it evaporates almost instantly.

The natural response from many of us who are not part of the library tradition is, naturally, “archive it all.” Even if today’s technologies make that a difficult prospect, proponents suggest, Moore’s Law is even more in evidence in the development of storage technology than it is in processor advancement. Consider how quickly the cost of a megabyte of storage has dropped over the last half-decade. Soon, we will be able to carry the Library of Congress in our wallet, and when that happens, archiving the Internet is sure to follow, at least if you believe the person who has the most experience with this.

The problem is that the content of the internet grows to encompass whatever local storage and bandwidth becomes available. While the number of hosts may not be growing exponentially, and Google has presumably cached the 3 billion pages it indexes, the problem is that these sites increasingly are capable of holding more and more data. Archivists are likely to have to continue to play catch-up perpetually. Moreover, even with contemporaneous web content, it is difficult to know what we have and where to find it.

The question remains, what are the major challenges to identifying materials of future interest, and how do we meet those challenges? More especially, what (if any) special challenges and opportunities do weblogs provide to the archival process. Weblogs represent the vanguard of the web, and they hint at the kinds of problems we are likely to encounter in archiving the rest of the web in the future. So understanding the difficulties and prospects found in archiving blogs can help inform the larger process.

The Challenges

Many still consider blogging a fad. It would seem, from the perspective and history of similar technologies, genuinely unlikely that it would become a significant part of how people interact with the world. Of course, the same might have been said of email, a technology that is now so ubiquitous that, with the exception of spam, its effect goes largely unnoticed. I suspect, in fact, that blogs are currently at the peak of public attention, and there will be a watershed period during which the number of blogs and bloggers plummets, followed by steady growth of blog-related technologies in the middle and long term. (That is, it will follow something like the Gartner Group’s Hype Cycle.) Of course, it’s not easy to predict the future of technology, but it seems likely that some form of distributed publishing will continue to take place through this century.

Even if blogging is more than a fad, many will suggest that blog content is ephemeral for a reason: that it is not worth archiving. There are three reasons to question this. First, some portion of weblog content is undeniably worthy of attention by future historians. At present, facts regarding the authenticity of documents “uncovered” in a broadcast report are being heavily discussed on weblogs. This presentation itself, I would hope, would be worthy of future consideration. Likewise, weblogs are increasingly the public journals of people in all walks of life. The fact that these weblogs are unedited means that much of it is not of interest to the casual reader. It is much more difficult to know whether a particular item will be of future interest, prompting an effort one news outlet called “saving Shakespeare’s blog”. Finally, social historians have largely had to “make do” with whatever has been saved: but it is wrong to assume that this indicates their preferred source. Indeed, many historians are most interested in the day-to-day life of an ordinary person, and weblogs may provide the such a window into ordinary lives.

At present, the enormity of the archival task is daunting. One of the defining characteristics of blogs is that they are updated frequently, in many cases daily. There are mechanisms that would make keeping an archive up to date easier (e.g., update pinging), but they are not employed universally. Identifying new blogs, and recognizing which blogs are no longer updated, can also be challenging. There have been efforts at taking a kind of moving snapshot of most of the blogs on the web (e.g., the NITLE Blogcensus), and several services regularly process the content of updated blogs in order to extract useful information (e.g., Technorati). However, outside of the Internet Archive, none of these is attempting to preserve continuously the full state of these weblogs. There are simply too many, changing too quickly. As more and more people keep blogs, and keep multiple blogs, we will likely find the increase in numbers of blogs (as well as churn) to continue to be a difficult problem.

Audio and video blogs, or textual blogs that contain these elements, represent a very small part of the whole, but these are likely to become more common as more people make use of camera phones and other mobile devices. At present, multimedia itself is often a size and bandwidth bottleneck on the web. The most widely viewed multimedia files have begun to be archived thanks to the Internet Archive’s freecache service, but less widely viewed multimedia is generally not archived. As we move beyond even Bush’s memex, and researchers begin to keep a continuous lifelog, we will be faced with the question of how to make sure local recordings are somehow distributed and archived in a useful way.

There is a significant amount of redundancy found in entries on weblogs. Despite being highly personal, many sites comment on news reports and add only personal opinion or analysis. An overly extreme analogy would ask whether keeping individual votes in an election would be worthwhile, or if the result is all that is important. Postings that merely reiterate or amplify information found elsewhere may be seen as less important to archive. Unfortunately, even information that is unique is often highly intertextual. Because so much of what is written on weblogs concerns current events, and because it is part of a continuing narrative, there are both temporal links to earlier changes, and hyperlinks to a broader web. Paradoxically, while many blog entries may seem redundant, most also rely heavily on context to be properly interpreted.

Finally, despite the fact that most blogs are produced by a fairly limited set of content management systems, there is only partial consistency in the structure of information in a weblog. There have been some early efforts to mine that structure in order to provide some kind of semantic tagging for the content, but the flexibility afforded to the users of most blogging systems makes it difficult to be able to consistently extract such information.

The Opportunities

The success of weblogs over the last several years provide us with some suggestions for how to exploit the structure and nature of blogging to help inform the process of archiving. Weblogs present some structural conditions that can potentially make the archival process more effective, but perhaps more importantly, it may be possible to leverage the collective activities of bloggers to aid in the archival process.

There has been a great deal of discussion over whether the semantic web is more than just hyperbole, and especially whether it is as inevitable as early promoters suggested. The blogging community has become a kind of tinkerer’s testbed of semantic data, in which necessity has led to rough design and collective refinement. Nowhere is this as obvious as in the case of the development of RSS. For the search engine developer, and for similar reasons for the archivist, RSS represents a prime resource to be exploited. Changes to a site are presented in an easily accessible format, and data regarding their source and authorship are presented in a form easily parsed and archived. Other forms of metadata, from Dublin Core meta tags to Creative Commons licenses in RDF format, are also more frequently found on weblogs than on other websites, at least for the time being. Public RSS aggregators like Bloglines already archive the output of a large part of the web, and provide an additional approximation of the readership of various weblogs.

In almost all cases, weblogs produce their own self archive. That archive is particularly tenuous. The author or authors of a particular weblog is normally able to change the archived content without a clear indication that such a change has taken place. More importantly, however, the archive only lasts as long as the initial author keeps the weblog. Many weblogs are short-lived, and in any event, we can assume that all weblogs are likely to be kept in operation for a finite amount of time. These local archives need to be duplicated elsewhere. At present there is nothing as simple as RSS that allows for these archives to be duplicated. However, similarity in archive addressing patterns allow for the ability to extract these archives in most cases. This means that an archiving effort need not keep up with the quick pace of changes on many weblogs, and instead can rely on archives for these sites. If a standardized way accessing these archives were evolved within the blogging world, it is possible that it would be adopted by other frequently-updated websites.

The fate of centralized libraries in the distant and recent past should guide us toward ways of distributing archives. Automatically accessible self-archiving is one way to do this, but there are at least two others. A project analogous to (or based on) the Freenet project might be another. A distributed archive that made use of some part of a large group of users’ hard drives, along with error correction and encryption, would provide an extraordinarily resilient archive of the web. Users could select data that should be archived (perhaps automatically by mining the local browser cache), and contribute resources toward the storage of the archive. In many ways such a peer-to-peer archive is a very promising direction for archiving over the long term.

In the near term, users may be exploited to identify information to be stored. Presumably, this resource identification feature is one of the functions of the Alexa toolbar. Bloggers routinely annotate the wider web, identifying resources that are of interest and indicating why they might be valuable. This ongoing review of material provides a way for archivists to locate material that is of particular popular interest at any given time. Indeed, services like Blogdex or Daypop would provide such a list of “sites of interest.” Alternatively, provides an ongoing collection of such links, and Furl takes this a step further, archiving pages on a user’s request. Another alternative is an extended form of self-archiving, which locally caches copies of pages linked from a weblogs. Although there has been interest expressed in such a system, I am not aware of one being used at present.

Given that such collective approaches to archiving are already emerging organically from the blogging community at large, and the advantages of distributed archives over localized copies, it seems to me that such approaches should be investigated and encouraged, and that a marriage between distributed efforts and more centralized repositories might be explored.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted 9/17/2004 at 10:42 pm | Permalink

    almost morning and you may be up with jet lag. been thinking about this post, its a complex question. Looked up Jim Gemmel’s link over from microsoft and his power point presentations are not available; neither are Gorden Bell’s.

    let me start with a quote inside The Art of Computer Programming by Donald Knuth:

    “But you can’t look up all those numbers in time,” Drake objected. “We don’t have to Paul. We , merely arrange a list and look for duplications.”

    Perry Mason, Case of the Angry Mourner (1951)

    The first thing thing that went through my mind was what stephanie perrin spoke about in april during the iwis: perhaps there is a species specific reason why we forget. We may need to learn how to erase certain things for the sake of peace of mind. Perhaps the human psyche cannot and should not touch upon certain types of remembering.

    There is the oral tradition of remembering, which involved song, dance, poetry. Then came the written word: first amoungst priests and then for commerce. It spread amoungst persons and then there where libraries to store the known knowledge. As literature came into being, it was questioned by a peripatetic (spelling, well perpataou means to walk in greek, it just sounds like that to me and you know what i mean) Socrates. He questioned the written wisdom of the poets and the playrights; what is it that we know? is it all an argument, a sophistry?

    Then came a Nietzsche with his questions and syphilitic aphorisms.

    Within the haze of history, there is a hazy remembering and forgeting. So how will this create a collective remembering and lead us to finding what will be stored forever and what will be somehow be part of less enduring fate of meaningless clutter.

    While Arun Blake was in NYC, we discussed his “glog” and the entire recording of the that meeting we attended, there was lots of stuff before and after that was simply social clutter, and not the meat and potatoes of what we need to hear to get at the concepts.

    I guess history will be increasing influeces by the overall archiving process whether it is of blogs or glogs.

    I read your blog from the bottom up: then re read it down scanning what my readers eye found most interesting, then i read it from beginning to end, with the writers intent in mind.

    I think dealing with information, persons will begin to take it apart and look at it from multiple cognitive perspectives, just as persons surf and hyperlink, and trackback.

    I disagree that we are developing multiple on line digital personalities, but rather, this gear changing ability is changing the internal narrative that we used to call the super ego into a rerouted frontal lobe that interacts with the limbic system in parrallel with the external devices we use to extend our selves. Hence how our minds evolve to fit into this world will be similar to the inventions we build to archive, and data mine and sort. So storage will grow faster than 10 to the 10, or super K as Knuth puts it in his computers and God lecture. Now thats a big finite number: but we never get to infinity outside of the human brain. I do not believe there is capacity in this generation to achieve a real time zero conception towards defining the methods by which we summerize “allness.”

    My assertion is that as long as there is always something more , or beyond human comprehension, the we move forward in history. once we create an enviroment when we are not creating more data to be stored and analysed than is possibly to be stored, then we contract historically into a dark age. When we undermine the human ability to conceive infinity, we enter the real of finite certainty, which is really bad for humanity. “Allness become a modern tower of Babylon, with the reaction of nature being a disaster that can happen from these technologies stuck in the finite of machine.

    To be free, we need to be creative, which will make individuals able to create meaninful self histories. Within these self histories we will find the human triumph of beauty dispite adversity, which will be the moments archivers and researchers of the future will find most interesting.

  2. Posted 9/17/2004 at 10:45 pm | Permalink

    perry mason got cut out. sorry for the long response…

  3. Posted 9/18/2004 at 12:01 am | Permalink

  4. Anonymous
    Posted 9/18/2004 at 9:23 pm | Permalink

  5. Posted 9/22/2004 at 5:37 pm | Permalink

    Furl is actually here. You were pointing to the .com, which belongs to a squatter.

  6. Posted 9/23/2004 at 6:04 pm | Permalink

    Thanks scamper. I must have been working offline, since I added .com in other places. Now it is pointing to the right place.

  7. Posted 9/27/2004 at 10:39 am | Permalink

    There will come a point where enough of what you do is captured and marked up chronologically that your one view, perhaps the dominant view, of all your artifacts is the weblog. At that point text blogs become a difference without a distinction. We’ll see the Google and Microsoft and Mozilla clients with blogging affordances, tools, and views of data they present through email, calendar, browser, addressbook.

    The redundancy in blogs is like the redundancy built into human speech, into holographic memory, into video encoding. Redundancy through citation assures human understanding. Redundancy through multi-channelling blog posts (the same post on multiple blogs or in html and RSS) and through topics/categories amplify post-to-post context.

    About scale and storage scarcity. The traditional cite about the information avalanche: 200 years ago America published in one year the equivalent of one Sunday New York Times (I’m getting it wrong but you know what I mean: more published and we read more in a weekend than previous generations read in a lifetime). Now we have the Encylopedia Galactica at our fingertips and the classic problem of finding stuff.

    I’ll argue that capturing everything, even with all its inefficiencies, is only a temporary (2-3 year) problem that Moore’s Law will fix. The real opportunity and challenge is making sense of the artifacts. Using them to solve real problems. Like deciding if this brand of soy milk is tasty before you buy it. Or find a job. Or a bone marrow donor. Or lucid answers to philosophical questions. Or true love.

12 Trackbacks

  1. […] 17;s a good question… Filed under: General — protesterbee @ 3:33 pm A. Halavais discusses a question about blogs and about the internet that I had never really consi […]

  2. By Alex Halavais » At the British Library on 9/20/2004 at 9:24 pm

    […] here, and some interesting ideas being passed around. I would have thought my blog-centric views on archiving would have failed to find an audience among the library-centric folks here. Ther […]

  3. […] ed space–and much more than Kahle’s one-square meter stack of linux boxes [via Alex]. Retention decisions are made because of practical considerations often having to do more wit […]

  4. […] ed space–and much more than Kahle’s one-square meter stack of linux boxes [via Alex]. Retention decisions are made because of practical considerations often having to do more wit […]

  5. […] ed space–and much more than Kahle’s one-square meter stack of linux boxes [via Alex]. Retention decisions are made because of practical considerations often having to do more wit […]

  6. By Alex Halavais » Looksmart Acquires Furl on 9/24/2004 at 8:38 pm

    […] Looksmart Acquires Furl I’ve recently posted about Furl, a site that allows you to archive bits of the web that you do not want […]

  7. By Sousveillance on 9/18/2004 at 2:37 pm

    Finding the Pre-sousveillance Grammer
    “I guess learning how to manipulate numbers can help deal with huge pools of information; this is all towards learning how to pool glogs and to maintain the integrety of sousveillance teams.”

  8. By sousveillance on 9/18/2004 at 2:45 pm

    you have to store, sort, and data mine to learn how to archive
    ok, trouble with my templates from my blog: i guess i got to learn this perl thing. also one kind of has to goof up a bit as part of the learning process. if one watched Knuth, he goofs up in the funniest ways, lots can be learned, even if one doesn’t…

  9. By Sousveillance on 9/19/2004 at 12:49 pm

    Towards Understanding Continous Archiving, Blogging and Glogging
    hen these tables where switched to punch cards. It became obvious that it was cheaper to “recompute” log x or cos x. The issue of sorting was the first thing addressed by computing as a field. Searching, because internal memories where small, or file…

  10. By soulsoup on 9/20/2004 at 12:33 am

    Archiving the Information Tsunami
    Blogs and Archiving by Alex Halavais ..archiving the web will become much more difficult over the coming decade. This is particularly true because of the shift from a textual web to databases, images, and other mixed media. Tracking changes in a scient…

  11. […] [40] Alex Halavias, Blogs and Archiving. 16 September 2004. […]

  12. […] escribe Alex Havias, “many weblogs are short-lived, and in any event, we can assume that all weblogs are likely […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>