As you may have heard, AOL recently released a data set that includes about 20 million searches from about 500 thousand users. This is a bit of a treasure trove for researchers, as it provides an example of what people search for and how their searches change over time. While users are identified only by a number, as you might guess, a history of searches can provide some interesting and revealing patterns: patterns the users probably did not intend to be revealed publicly. Perhaps not surprisingly, AOL quickly pulled down the data set, but the page is still viewable in the Google cache. Moreover, while you cannot download the file from AOL, it is available as a torrent.
What an extraordinarily tempting piece of research data. And so, the ethical question comes quickly to the fore:
Clearly, no Institutional Review Board would ever allow such a collection. Users had a reasonable expectation that their searches would not be recorded and openly distributed. Moreover, the ability to link searches of a given user makes this a potentially very revealing data set. See, for example, the user looking to kill his wife. I don’t think that the anonymization of user names is enough to make this usable.
And yet, there it is. It’s already out there, and as I said, very tempting. Is it ethical to make use of this already-collected data if your use substantially masks the private matters of these users. Any use I would make of the data would make it extremely unlikely that any private information would be revealed–though the mere existence of the public data set in some ways makes this moot.
The obvious parallel (Godwin’s law notwithstanding) is the controversy over using Nazi experimental data in medical research. But it seems to me that there are some shades of grey here. AOL Search is not a Nazi concentration camp, and it is worth noting that an article based on the data has already appeared in peer-reviewed conference proceedings (pdf). While I think that the distribution of their search data without the clear permission of its users, either to the public or to the government, is pretty clearly unethical, I don’t know that it makes this data poison fruit. Tainted, yes; poison, I don’t think so.
Finally, I wonder what AOL’s move is now. They’ve pulled the plug on the page, but lots of people presumably have and will share the data. If AOL now revokes permission to use the data, what does that mean? Do they own the data at this point? Providing and then pulling back data would set a terrible precedent.
Update: TechCrunch has a report on this as well.