AOL Data

As you may have heard, AOL recently released a data set that includes about 20 million searches from about 500 thousand users. This is a bit of a treasure trove for researchers, as it provides an example of what people search for and how their searches change over time. While users are identified only by a number, as you might guess, a history of searches can provide some interesting and revealing patterns: patterns the users probably did not intend to be revealed publicly. Perhaps not surprisingly, AOL quickly pulled down the data set, but the page is still viewable in the Google cache. Moreover, while you cannot download the file from AOL, it is available as a torrent.

What an extraordinarily tempting piece of research data. And so, the ethical question comes quickly to the fore:

Clearly, no Institutional Review Board would ever allow such a collection. Users had a reasonable expectation that their searches would not be recorded and openly distributed. Moreover, the ability to link searches of a given user makes this a potentially very revealing data set. See, for example, the user looking to kill his wife. I don’t think that the anonymization of user names is enough to make this usable.

And yet, there it is. It’s already out there, and as I said, very tempting. Is it ethical to make use of this already-collected data if your use substantially masks the private matters of these users. Any use I would make of the data would make it extremely unlikely that any private information would be revealed–though the mere existence of the public data set in some ways makes this moot.

The obvious parallel (Godwin’s law notwithstanding) is the controversy over using Nazi experimental data in medical research. But it seems to me that there are some shades of grey here. AOL Search is not a Nazi concentration camp, and it is worth noting that an article based on the data has already appeared in peer-reviewed conference proceedings (pdf). While I think that the distribution of their search data without the clear permission of its users, either to the public or to the government, is pretty clearly unethical, I don’t know that it makes this data poison fruit. Tainted, yes; poison, I don’t think so.

Finally, I wonder what AOL’s move is now. They’ve pulled the plug on the page, but lots of people presumably have and will share the data. If AOL now revokes permission to use the data, what does that mean? Do they own the data at this point? Providing and then pulling back data would set a terrible precedent.

Update: TechCrunch has a report on this as well.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. ty
    Posted 8/12/2006 at 11:14 am | Permalink

    A site where you can search this data is here:

  2. Posted 8/22/2006 at 4:52 am | Permalink

    Here’s a *quick* site where you can search the AOL data for yourself:

5 Trackbacks

  1. By Augury on 8/7/2006 at 3:46 pm


    This is too much.
    Stories like this are getting all too frequent — And these aren’t even attacks. These are just companies or governments treating private records of customers’ or citizens’ actions lightly. What happens when act…

  2. By Crooked Timber » » The AOL data mess on 8/7/2006 at 8:29 pm

    […] Another question of interest: Now that these data have been made public what are the chances for approval from a university’s institutional review board for work on this data set? (Alex raises related questions as well.) Would an approval be granted? These users did not consent to their data being used for such purposes. But the data have been made public and theoretically do not contain any identifying information. Even if they do, the researcher could promise that results would only be reported in the aggregate leaving out any potentially identifying information. Hmm… […]

  3. […] Alex Halavais funderar över liknande frÃ¥geställningar och man kan riktigt känna hans datakättja: “While I think that the distribution of their search data without the clear permission of its users, either to the public or to the government, is pretty clearly unethical, I don’t know that it makes this data poison fruit. Tainted, yes; poison, I don’t think so.” […]

  4. […] I do think there are parallels between this and the release of the AOL search data. In both cases, designers failed to predict the potential privacy implications of their systems. It’s worth contrasting these with another rollout over the last few weeks: Flickr’s addition of photo geotagging capabilities. I would be surprised if Flickr actually engaged in more participatory design than Facebook or has, but they made clear when you started geotagging that it would affect your privacy, and gave you a reasonably fine-grained control over who would see what. […]

  5. […] If this becomes common, the difficulty is manifest. Researchers who follow even a small degree of ethical behavior will be left in the dust by “amateurs” (in the kindest sense of the word) and professionals (in the least kind sense) who do not recognize the ethical problems of making use of data that has been taken against the wishes of its owners. We’ve already seen this: I’ve posted about issues of scraping social network sites and the AOL data. But is this the future of online research: a sea of questionable datasets, traded on the black market, and unavailable to researchers who would most benefit from them? […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>