Archive for September, 2004

Search Engines and Jobs, Oh My

Friday, September 24th, 2004

Thursday, October 7, in room 330 of the student union here at UB, Gary Price (one of our graduates) will be giving two talks. Both would be of interest to students in the Media in the Information Age. Hey, they could even blog it! That’s right: real content, for free!

9-11 am: Search Engines, Current Awareness, and Tools of the Trade

1:30 - 3 pm: Career Choices for Today’s Graduates, and How to Market Yourself in the Workplace.

RSVP ASAP to Kirsten Brill (kbrill@buff…) or Ann Kutner (alkutner@buff…) if you plan to attend.

Looksmart Acquires Furl

Friday, September 24th, 2004

I’ve recently posted about Furl, a site that allows you to archive bits of the web that you do not want to disappear. An email went out yesterday to Furl users indicating that it has been acquired by Looksmart. This is a very interesting development, for a few reasons.

While discussing Furl with those who were unfamiliar with it, many asked what the drawbacks were. I noted that most of them had to do with archiving locally, and the issue of whether maintenance and bandwidth costs would kill it when it got popular. Or, more likely, whether it would be sued out of existence by, for example, major media outlets. Major investment has at least shored up the first threat: with enough funding it should be around for a while. But it further complicates the second issue. How will the new company deal with IP questions? It seems to me that it is opening itself up a lot of liability; not a problem if you are a garage-company, a bit more if you actually have some assets. They do have the “don’t do anything bad” section in their ToS, but then, so did Napster.

In either case, Furl will be a project to watch. I’ve been planning on making a WordPress plugin for Furl, but as Matt notes, the better path would be to keep local archives. Those local archives could then provide an easily accessible directory to directory server crawlers, leading to a distributed global internet archive. Maybe if I have some spare time in the next months (or Lazyweb?).

College Radio Online [At IR5.0]

Wednesday, September 22nd, 2004

One of the presentations at Internet Research 5.0 was on the future of college radio in the era of webcasting. David Park, a professor at Lake Forest College. He is the faculty advisor for the college station there, which recently started broadcasting over the web, using a service called Live365. He realized that when they started broadcasting hockey games over the web, the parents of the players, sometimes living very far away, suddenly became listeners of their “local” station. Park says that this raised questions for him about the nature of local radio. Was this an example of the “death of distance”?

Local radio has been considered important for a long time. It allows for people to hear and learn about their local area, hear local artists, and generally helps to connect people within a community. Already, people have said that there is a death of distance when it comes to radio, because there are a limited number of formats across the US. You can turn on the radio in San Diego, and it will generally sound the same as if you were listening in Boston. There is value in local broadcasting, and some of the value has even been codified in law. We don’t want to lose it.

Some have suggested that the ideal is a number large-scale radio stations (or syndicated networks), along with small stations. Unfortunately, the financial pressures of finding advertisers makes it hard to stay in business as a small broadcaster. The nature of college broadcasting, usually subscriber-supported, has managed to keep most of them in business even as much of the industry has consolidated.

It makes sense that college broadcasters would want to go online. Not only can they attract an audience without worrying about getting expensive, more powerful transmitters, it is also a very good way of raising money for the station. Unlike the traditional telethon, users need only click and give, and as a result, they end up with more funds than they would otherwise.

Webcasting allows you to keep much closer track of how big an audience is, and where they are listening from, minute to minute. For directors, the temptation to tune your programming to react to that newly visible audience is very strong.

So what happens when college radio goes from local to global by broadcasting on the Web? Will the new audience shape the content of the news in new ways? Will you have to serve a broad audience who is not interested in local musicians or college sports teams? Park decided to interview directors of college radio stations to see whether this was already happening. He did hour-long interviews with ten directors, and numerous other shorter interviews.

His worries, it turns out, were unfounded for now. Even though webcasting has been around for several years, most listeners remain within the standard broadcast area–they just prefer to listen on the computer. The formats on the radio haven’t changed much during this period, at least from the perspective of those doing the programming.

They did raise two interesting issues though. Many of the out-of-area folks are actually alumni, who leave town but still want to listen to the old station. This could be good, since it raises cohesiveness, but it could also lead to more conservative, unchanging programming. Old listeners may not want to see the station change, while in the broadcast model, the audience “cycles” every four years or so.

The other noted item was that the webcast attracted a lot of listeners overnight, when it was daytime in Japan and elsewhere. One possible future for webcast college radio is that it would broadcast for a more local audience during the day, and a more global audience at night.

This raises the question more generally, I think, of how the change in the delivery system, including the reach and capabilities of the new medium, might affect the content more generally. How will big radio react? What is the future for this new form of broadcasting?

At the British Library

Monday, September 20th, 2004

[The following was almost liveblogged, but the computer ran out of juice just before it was ready to publish. I am at the University of Sussex, at the Association of Internet Researchers annual conference, and it seems like access to the Internet remains bad every year. Hopefully, next year in Chicago will be better. Over the next few days, I'll try to get up a backlog of posts. Also, if you are waiting on grades from me, or other responses, I'll be catching up this weekend when I get back to the states.]

I am now just coming off a short lunch break at the meeting on web archiving in the board room of the British Library. I am not much one for the whole “liveblogging” thing, as I’ve noted before. I like to give things a chance to stick. Unfortunately, since I had to walk off an airplane to this meeting (I could really use a shower!), so I’m afraid I might forget some of this. There is a great group of researchers here, and some interesting ideas being passed around. I would have thought my blog-centric views on archiving would have failed to find an audience among the library-centric folks here. There is, in fact a difference in the way librarians and more blogcentric people think about archiving, but there is more interest in sharing ideas than I might have expected. The major difficulties, if any, seem to revolve around vocabularies (e.g., what constitutes “metadata”). Many of the ideas I presented in my talk had already come out in some form. The biggest thing people seemed interested in was furl.

Lots of people presented their ideas, I probably should go into more detail here, but I’m not going to. Steve Schneider talked a bit about his experience with coordinating thematic archives, and systems that facilitate this process. Paul Koerbin talked about the Pandora project. Lots of good comments from folks. Pierre Levy once again presented on his ideas towards a universal semantic category system.

After lunch, we are discussing some of the “use cases” on which he national archive will work. Quickly, we get into some pretty wild requests. This is an interesting approach: demonstrating the cases and getting feedback rather than talking about specifications. Frankly, though, the front end is less important to me than revealing the internals.

The main hangup seems to be how to select what should be archived. It’s a little like “what three records would you take with you if you were going to be stranded on an island?” What on the Web really matters enough that the national libraries should be going out and saving it. Steve Schneider, later at the conference, noted that the question was really between prospective needs of scholars, and retrospective needs. The retrospective needs are really important, but extremely hard to predict. My opinion is that the best thing to do is be very receptive to what scholars want archived now and hope that this also counts toward the future.

A comment by Torill Mortensen got us talking about archiving not just the web, but the total experience of the web. At a very basic level, this means holding on to the software that is being used, either by storing machines or emulating them. But she was particular interested in the question of how to document the kind of information and social ecology that exists around the web. Many of us today have been using broadband exclusively for so long we can’t even imagine dial up speeds (or the old 110/300 switchable bps modems). We need to be able to understand what the web was, not just what it contained.

Blogs and Archiving

Thursday, September 16th, 2004

This Saturday morning, I will give a short presentation to a working group of the International Internet Preservation Consortium (IIPC) at the British Library on the role of weblogs in archiving (and archiving weblogs). The consortium consists of several national libraries, and our working group has the unenviable task of trying to determine what future historians will want to examine from today’s web. What do we save and why? The effort here is less to answer than to ask questions, and I am posting this outline in the hope readers may have ideas.

A recent conference helped to convince me that archiving the web will become much more difficult over the coming decade. This is particularly true because of the shift from a textual web to databases, images, and other mixed media. Tracking changes in a scientific database, for example, would likely be of importance to future researchers, but presents special challenges. On the flight back, I read the following in an article in the in-flight magazine, from a compiler of an online cuneiform lexicon:

Because it is electronic and will be continuously updated, it will be a living, eternally morphing resource that will never go stale. It will also make the teaching of Sumerian much easier.

Such a “morphing” resource makes the job of the archivist even more important. While regular publications might go through clear editions, and older editions may be retained, here the accumulation of knowledge looks more like a blackboard than a stone tablet. They still turn up such tablets in the desert, representing such things as a sale receipt for a goat purchase, and they remain intact because they are innately difficult to destroy. The content of the web, on the other hand, requires an active effort to copy and store, else it evaporates almost instantly.

The natural response from many of us who are not part of the library tradition is, naturally, “archive it all.” Even if today’s technologies make that a difficult prospect, proponents suggest, Moore’s Law is even more in evidence in the development of storage technology than it is in processor advancement. Consider how quickly the cost of a megabyte of storage has dropped over the last half-decade. Soon, we will be able to carry the Library of Congress in our wallet, and when that happens, archiving the Internet is sure to follow, at least if you believe the person who has the most experience with this.

The problem is that the content of the internet grows to encompass whatever local storage and bandwidth becomes available. While the number of hosts may not be growing exponentially, and Google has presumably cached the 3 billion pages it indexes, the problem is that these sites increasingly are capable of holding more and more data. Archivists are likely to have to continue to play catch-up perpetually. Moreover, even with contemporaneous web content, it is difficult to know what we have and where to find it.

The question remains, what are the major challenges to identifying materials of future interest, and how do we meet those challenges? More especially, what (if any) special challenges and opportunities do weblogs provide to the archival process. Weblogs represent the vanguard of the web, and they hint at the kinds of problems we are likely to encounter in archiving the rest of the web in the future. So understanding the difficulties and prospects found in archiving blogs can help inform the larger process.

The Challenges

Many still consider blogging a fad. It would seem, from the perspective and history of similar technologies, genuinely unlikely that it would become a significant part of how people interact with the world. Of course, the same might have been said of email, a technology that is now so ubiquitous that, with the exception of spam, its effect goes largely unnoticed. I suspect, in fact, that blogs are currently at the peak of public attention, and there will be a watershed period during which the number of blogs and bloggers plummets, followed by steady growth of blog-related technologies in the middle and long term. (That is, it will follow something like the Gartner Group’s Hype Cycle.) Of course, it’s not easy to predict the future of technology, but it seems likely that some form of distributed publishing will continue to take place through this century.

Even if blogging is more than a fad, many will suggest that blog content is ephemeral for a reason: that it is not worth archiving. There are three reasons to question this. First, some portion of weblog content is undeniably worthy of attention by future historians. At present, facts regarding the authenticity of documents “uncovered” in a broadcast report are being heavily discussed on weblogs. This presentation itself, I would hope, would be worthy of future consideration. Likewise, weblogs are increasingly the public journals of people in all walks of life. The fact that these weblogs are unedited means that much of it is not of interest to the casual reader. It is much more difficult to know whether a particular item will be of future interest, prompting an effort one news outlet called “saving Shakespeare’s blog”. Finally, social historians have largely had to “make do” with whatever has been saved: but it is wrong to assume that this indicates their preferred source. Indeed, many historians are most interested in the day-to-day life of an ordinary person, and weblogs may provide the such a window into ordinary lives.

At present, the enormity of the archival task is daunting. One of the defining characteristics of blogs is that they are updated frequently, in many cases daily. There are mechanisms that would make keeping an archive up to date easier (e.g., update pinging), but they are not employed universally. Identifying new blogs, and recognizing which blogs are no longer updated, can also be challenging. There have been efforts at taking a kind of moving snapshot of most of the blogs on the web (e.g., the NITLE Blogcensus), and several services regularly process the content of updated blogs in order to extract useful information (e.g., Technorati). However, outside of the Internet Archive, none of these is attempting to preserve continuously the full state of these weblogs. There are simply too many, changing too quickly. As more and more people keep blogs, and keep multiple blogs, we will likely find the increase in numbers of blogs (as well as churn) to continue to be a difficult problem.

Audio and video blogs, or textual blogs that contain these elements, represent a very small part of the whole, but these are likely to become more common as more people make use of camera phones and other mobile devices. At present, multimedia itself is often a size and bandwidth bottleneck on the web. The most widely viewed multimedia files have begun to be archived thanks to the Internet Archive’s freecache service, but less widely viewed multimedia is generally not archived. As we move beyond even Bush’s memex, and researchers begin to keep a continuous lifelog, we will be faced with the question of how to make sure local recordings are somehow distributed and archived in a useful way.

There is a significant amount of redundancy found in entries on weblogs. Despite being highly personal, many sites comment on news reports and add only personal opinion or analysis. An overly extreme analogy would ask whether keeping individual votes in an election would be worthwhile, or if the result is all that is important. Postings that merely reiterate or amplify information found elsewhere may be seen as less important to archive. Unfortunately, even information that is unique is often highly intertextual. Because so much of what is written on weblogs concerns current events, and because it is part of a continuing narrative, there are both temporal links to earlier changes, and hyperlinks to a broader web. Paradoxically, while many blog entries may seem redundant, most also rely heavily on context to be properly interpreted.

Finally, despite the fact that most blogs are produced by a fairly limited set of content management systems, there is only partial consistency in the structure of information in a weblog. There have been some early efforts to mine that structure in order to provide some kind of semantic tagging for the content, but the flexibility afforded to the users of most blogging systems makes it difficult to be able to consistently extract such information.

The Opportunities

The success of weblogs over the last several years provide us with some suggestions for how to exploit the structure and nature of blogging to help inform the process of archiving. Weblogs present some structural conditions that can potentially make the archival process more effective, but perhaps more importantly, it may be possible to leverage the collective activities of bloggers to aid in the archival process.

There has been a great deal of discussion over whether the semantic web is more than just hyperbole, and especially whether it is as inevitable as early promoters suggested. The blogging community has become a kind of tinkerer’s testbed of semantic data, in which necessity has led to rough design and collective refinement. Nowhere is this as obvious as in the case of the development of RSS. For the search engine developer, and for similar reasons for the archivist, RSS represents a prime resource to be exploited. Changes to a site are presented in an easily accessible format, and data regarding their source and authorship are presented in a form easily parsed and archived. Other forms of metadata, from Dublin Core meta tags to Creative Commons licenses in RDF format, are also more frequently found on weblogs than on other websites, at least for the time being. Public RSS aggregators like Bloglines already archive the output of a large part of the web, and provide an additional approximation of the readership of various weblogs.

In almost all cases, weblogs produce their own self archive. That archive is particularly tenuous. The author or authors of a particular weblog is normally able to change the archived content without a clear indication that such a change has taken place. More importantly, however, the archive only lasts as long as the initial author keeps the weblog. Many weblogs are short-lived, and in any event, we can assume that all weblogs are likely to be kept in operation for a finite amount of time. These local archives need to be duplicated elsewhere. At present there is nothing as simple as RSS that allows for these archives to be duplicated. However, similarity in archive addressing patterns allow for the ability to extract these archives in most cases. This means that an archiving effort need not keep up with the quick pace of changes on many weblogs, and instead can rely on archives for these sites. If a standardized way accessing these archives were evolved within the blogging world, it is possible that it would be adopted by other frequently-updated websites.

The fate of centralized libraries in the distant and recent past should guide us toward ways of distributing archives. Automatically accessible self-archiving is one way to do this, but there are at least two others. A project analogous to (or based on) the Freenet project might be another. A distributed archive that made use of some part of a large group of users’ hard drives, along with error correction and encryption, would provide an extraordinarily resilient archive of the web. Users could select data that should be archived (perhaps automatically by mining the local browser cache), and contribute resources toward the storage of the archive. In many ways such a peer-to-peer archive is a very promising direction for archiving over the long term.

In the near term, users may be exploited to identify information to be stored. Presumably, this resource identification feature is one of the functions of the Alexa toolbar. Bloggers routinely annotate the wider web, identifying resources that are of interest and indicating why they might be valuable. This ongoing review of material provides a way for archivists to locate material that is of particular popular interest at any given time. Indeed, services like Blogdex or Daypop would provide such a list of “sites of interest.” Alternatively, del.icio.us provides an ongoing collection of such links, and Furl takes this a step further, archiving pages on a user’s request. Another alternative is an extended form of self-archiving, which locally caches copies of pages linked from a weblogs. Although there has been interest expressed in such a system, I am not aware of one being used at present.

Given that such collective approaches to archiving are already emerging organically from the blogging community at large, and the advantages of distributed archives over localized copies, it seems to me that such approaches should be investigated and encouraged, and that a marriage between distributed efforts and more centralized repositories might be explored.

Remaindered Links

Thursday, September 16th, 2004

Here are some links I had hoped to post about, but didn’t get around to before the trip:

* Horrified Observers of Pedestrian Entertainment - Because someone needs to protest against Paris Hilton.
* Stripe Snoop - Software tools for reading and writing stripe data. What’s in your wallet?
* Human Development and Capability Association - Taking the capability approach to development.
* Infectious Awareables - Great anthrax spore micrograph tie. (Not so good for professors who might be worried about being arrested by the FBI.)
* Cornell redesign blog - What is the process of a large web-site development? Follow the path through Cornell’s redevelopment.
* Sun Blogs - Sun actively encourages employees to blog (along with many other tech companies).
* Eyetrack III - Where do people look first on newspaper web pages?
* Press Think - A blog dedicated to extracting “serious journalism” from “the Media.”
* Spellcheck your blog - Tools for spellchecking your blog entries. But none of these will catch homonyms, unfortunately.
* Like Kryptonite to Kryptonites - A Bic pen is enough to pick a Kryptonite lock?
* Wifi Woods - Folks are worried about wireless at the conference in Sussex this weekend. Luckily, all it will take to get connected is a short hike :).

Return to sender?

Thursday, September 16th, 2004



I know how I read this 404 when I picked it up. I wonder what it really means.