This is, in fact, a reply to a comment made by Rolan the Datapimp (of Orkut mapping fame) in the “below thread”:http://alex.halavais.net/news/archives/000789.html. Go read that first, if you haven’t already.
First, I want to thank both Alan and Rolan for great comments. It will give us a nice set of things to discuss in my seminar tomorrow when we talk about ethical issues surrounding scraping.
Rolan: I tend to agree with you about people basically providing information in a quasi-public manner. I suppose your argument that it was kept within Orkut would be stronger if you provided only access to bona fide members, but even then, I think that particular issue is less important. Dissemination, say, in an academic journal could be relatively limited (to several thousand, or hundred), but always with the potential of wide public consumption.
Let’s take your example of the “sex convention taking place at the Javitts center.” (Why don’t I ever get invited to the interesting conferences!?) Your implication here, I think, is that it is a large public gathering, and therefore different rules apply. It is not, for example, a workshop for breast cancer survivors for 80 people. In the latter case, it seems that collecting and disseminating the data would be particularly invasive. But what if you were a journalist, and went to the sex conference, where there were 25 or 30 thousand people all wearing name tags. Couldn’t you write them down?
Or, to keep with the metaphor, couldn’t you set up a camera at the front door and automatically photograph each person that came in, so you had a collected database of each of the attendees. The answer to the latter question is “probably not.” If you did this, they would send a couple of guys over to remove you, and remind you that photographs are not allowed at the conference.
And that’s where the problem exists. I am not alone in thinking that the Terms of Service are crap. They are asking you to provide personal data, but requiring that you not scrape that data from the site. This is the equivalent of the “no photography” rule. And, although you (Rolan) never note how you got the data, it is presumed that it was not a gift from the Google people.
As I originally noted, I think it would be appropriate for Google/Orkut to provide this data to researchers, in the same way that a convention might allow a photographer to take pictures as long as it wasn’t for nefarious purposes. What sort of purposes? The obvious answer there is highly directed spam–in both the metaphorical case and the actual case of Orkut.
So the problem I have — and it isn’t really with Rolan: as Alan says, I am jealous of the freedom he has — is whether the ToS should apply to me, if my purposes are research. The problem has both an ethical dimension (should I be doing this? is the potential harm in doing so greater than the potential good? or, in a non-utilitarian form, is there an absolute prohibition because people have a reasonable expectation that this will not occur), a practical issue (will this pass muster with IRB? is this a violation of accepted research standards such as the ethical standards promulgated by the Association of Internet Researchers [pdf] and the research community at large?), and a legal question (are Orkut’s terms of service with regard to collecting data enforceable?).
This problem is exacerbated by the fact that the instrument of my collection, rather than the collection itself, is the target of the prohibition. Again, I could pay a few students minimum wage to record this data manually. Would that be acceptable? What if I had them sit in a lab and hit each page, saving a local copy? This would appear exactly the same as a scraper to Orkut, and would yield exactly the same results, but would presumably be acceptable.
It turns out that this is roughly how I will be using the data for an upcoming research project. It will have a level of opt-in-ness from the participants, and will abide by the ToS, and will be reviewed by the IRB. I am still toying, however, with the idea that I may be able to use data from Orkut directly. Alan says it’s a clear “no,” but I think it’s a fuzzy no. One way to clear this up, frankly, would be to survey Orkut users to inquire as to their “expectation of privacy.” If it could be shown that members of the community expect their data to be used in such ways, then I think we could use the data more freely. I would expect that they approve of Rolan’s very cool work, and would not approve of a spammer using it. If we can show that this is a broad expectation, I think our use of the data would be acceptable to an IRB. Not sure how they would handle the ToS, though.