Rolan the Datapimp

This is, in fact, a reply to a comment made by Rolan the Datapimp (of Orkut mapping fame) in the “below thread”: Go read that first, if you haven’t already.

First, I want to thank both Alan and Rolan for great comments. It will give us a nice set of things to discuss in my seminar tomorrow when we talk about ethical issues surrounding scraping.

Rolan: I tend to agree with you about people basically providing information in a quasi-public manner. I suppose your argument that it was kept within Orkut would be stronger if you provided only access to bona fide members, but even then, I think that particular issue is less important. Dissemination, say, in an academic journal could be relatively limited (to several thousand, or hundred), but always with the potential of wide public consumption.

Let’s take your example of the “sex convention taking place at the Javitts center.” (Why don’t I ever get invited to the interesting conferences!?) Your implication here, I think, is that it is a large public gathering, and therefore different rules apply. It is not, for example, a workshop for breast cancer survivors for 80 people. In the latter case, it seems that collecting and disseminating the data would be particularly invasive. But what if you were a journalist, and went to the sex conference, where there were 25 or 30 thousand people all wearing name tags. Couldn’t you write them down?

Or, to keep with the metaphor, couldn’t you set up a camera at the front door and automatically photograph each person that came in, so you had a collected database of each of the attendees. The answer to the latter question is “probably not.” If you did this, they would send a couple of guys over to remove you, and remind you that photographs are not allowed at the conference.

And that’s where the problem exists. I am not alone in thinking that the Terms of Service are crap. They are asking you to provide personal data, but requiring that you not scrape that data from the site. This is the equivalent of the “no photography” rule. And, although you (Rolan) never note how you got the data, it is presumed that it was not a gift from the Google people.

As I originally noted, I think it would be appropriate for Google/Orkut to provide this data to researchers, in the same way that a convention might allow a photographer to take pictures as long as it wasn’t for nefarious purposes. What sort of purposes? The obvious answer there is highly directed spam–in both the metaphorical case and the actual case of Orkut.

So the problem I have — and it isn’t really with Rolan: as Alan says, I am jealous of the freedom he has — is whether the ToS should apply to me, if my purposes are research. The problem has both an ethical dimension (should I be doing this? is the potential harm in doing so greater than the potential good? or, in a non-utilitarian form, is there an absolute prohibition because people have a reasonable expectation that this will not occur), a practical issue (will this pass muster with IRB? is this a violation of accepted research standards such as the ethical standards promulgated by the Association of Internet Researchers [pdf] and the research community at large?), and a legal question (are Orkut’s terms of service with regard to collecting data enforceable?).

This problem is exacerbated by the fact that the instrument of my collection, rather than the collection itself, is the target of the prohibition. Again, I could pay a few students minimum wage to record this data manually. Would that be acceptable? What if I had them sit in a lab and hit each page, saving a local copy? This would appear exactly the same as a scraper to Orkut, and would yield exactly the same results, but would presumably be acceptable.

It turns out that this is roughly how I will be using the data for an upcoming research project. It will have a level of opt-in-ness from the participants, and will abide by the ToS, and will be reviewed by the IRB. I am still toying, however, with the idea that I may be able to use data from Orkut directly. Alan says it’s a clear “no,” but I think it’s a fuzzy no. One way to clear this up, frankly, would be to survey Orkut users to inquire as to their “expectation of privacy.” If it could be shown that members of the community expect their data to be used in such ways, then I think we could use the data more freely. I would expect that they approve of Rolan’s very cool work, and would not approve of a spammer using it. If we can show that this is a broad expectation, I think our use of the data would be acceptable to an IRB. Not sure how they would handle the ToS, though.

This entry was posted in Uncategorized and tagged . Bookmark the permalink. Trackbacks are closed, but you can post a comment.


  1. Posted 2/23/2004 at 3:03 am | Permalink

    I’m going to throw you a curveball by revealing some information.

    The Orkut user database was offered to me by an associate (who came into its posession through another individual). At the time I received it, we did not know the source of the information (directly from an individual within Google, or via spidering). Later, we discovered that the individual did not work for Google, and so the data was assumed to have been spidered.

    The Orkut TOS states that it is against policy to use “any robot, spider, site search/retrieval application, or other device to retrieve or index any portion or the service.” Fortunately, posession of the data is not prohibited by the TOS :) This is definitely walking within a grey area. If the case were taken to a US court, a “receipt of stolen property” argument could be used. Hrm.. suppose I had posted the whole user database onto this blog? Would everyone who accessed the page technically be considered guilty of receiving the stolen information, unknowingly or not? :P

    I am very much an advocate of privacy. If posting Orkut user information would permit spammers to harvest emails I would never have publicized the application. The potential harm in our situation has been minimized because of the fact that the only way one can gain access to emailing a Orkut user is by logging into the Orkut website (which requires that they become a member).

    Since we are probably in the clear from a legal perspective, the question of legitimately using the data reverts back to an ethical one.

  2. Posted 2/23/2004 at 3:24 am | Permalink

    If Orkut were more transparent and actually dealt with the privacy issues instead of hiding behind an extreme TOS, this wouldn’t be a problem. What they should actually do is to publish the data as FOAF but let the users decide what and how much of their personal profile was visible. Similarly, they *should* have a publicly visible HTML profile page. Again letting the users decide what is shown.

    It’s actually our data. We typed it in. It’s about us. We want it to be visible within limits. If we didn’t want it available we wouldn’t post to a public forum, all be it one that hides behind a login.

    “They stole our data. Now we’re stealing it back. One friend at a time.”

  3. Posted 2/23/2004 at 2:54 pm | Permalink

    This whole question is a lot like discussions that have taken place at LambdaMOO and no doubt many other online communities over the last decade or two. At LambdaMOO there was general uproar among users when researchers tried to do things like this. Even when researchers were mapping social networks between users based only on their MOO nicknames and not using their physical information, users rebelled. If you go poking around the archives of the in-MOO mailing lists (there are some with the topic research) you’ll find heaps of stuff about this. I think there were some ballots about it too.

    Alex, as a user of Orkut I certainly do NOT expect my information to be used for research without my explicit consent.

    Julian: “It’s actually our data. We typed it in. It’s about us. We want it to be visible within limits. If we didn’t want it available we wouldn’t post to a public forum, all be it one that hides behind a login.”
    –> for me, within LIMITS is pretty important here. I signed up for my data to be within a closed website, NOT spiderable, not open to everyone everywhere.

    Rolan: I don’t think the fact that the links to your map were from other bloggers, not you, makes a lot of difference. If it’s on the web, it’s public. One of the issues of contention on LambdaMOO was a researcher who had peoples’ permission to use their data (as I remember) but who had posted a draft version of her paper on the web, unlinked, without assigning pseudonyms to the subjects. They found it and were furious. As I’d be pissed off if my Orkut data was used for other purposes without my permission. Luckily I don’t live in the States so it wasn’t. This time.

  4. Posted 2/23/2004 at 3:11 pm | Permalink

    You might want to have a look at the Association of Internet Researchers’ report on ethics. And perhaps the following, taken from that report, is why my (Norwegian) reaction seems to be far stricter than the other comments here:


    This article has more discussion of this kind of issue – which crops up not only in MOO research but in research on mailing lists – and references:

    Lots of other research on this, too, of course.

  5. Posted 2/23/2004 at 5:40 pm | Permalink

    I guess Trackback is not working so I manually trackback my entry about this entry:
    Scraping a site is ethical?

  6. Posted 2/23/2004 at 5:47 pm | Permalink

    Jill’s comments were insightful and leave me a lot to read :) I do have a question though…

    At what point is the users’ privacy violated?
    – During the unauthorized collection of user data?
    – When one comes into possesion of the data?
    – The processing/analysis of the data?
    – When publishing the results of processed data?

    Also, you state that “If it’s on the web, it’s public.” Technically Orkut is on the web, and every members’ data is accessible by anyone with an Orkut login.

    Oh.. one last thought.. Would you feel that the users’ privacy were violated if the maps only displayed population density and did not feature an option to list names within the viewing area?

  7. Posted 2/24/2004 at 1:12 pm | Permalink

    Ah, OK. For me the problem is the (open) publication of my name in relation to data about me that I gave out in a different context than that in which it’s been published.

    I voluntarily gave out information in Orkut, but yes, although that is on the web (OK, I wasn’t specific enough there) it’s password protected and access is limited to others who have also voluntarily shared information about themselves. There’s a mutuality there, and I do experience a site like Orkut as a more closed form of publication than putting something freely on the web. I think this happens in email lists and places like MOOs, too, although anyone can join most of these communities, what is written there is meant FOR that community not for the general public.

    I wouldn’t have an issue with a map showing density but no names. For me – and European privacy legislation – it’s collecting and publishing identifiable data about individuals that’s illegal. Actually what you did, as far as I understand the law, would have been clearly illegal if you’d done it with European residents or citizens. I don’t know whether it’s legal under US law. As I understand it US privacy legislation is far less strict than the European. I’m not a lawyer and find these documents hard to scan and understand, but the actaul legislative documents are here:

  8. Posted 2/24/2004 at 10:57 pm | Permalink

    In Europe, do the publishers of telephone directories require consent from every named individual or organization listed in their books?

2 Trackbacks

  1. By Paolo Massa Blog on 2/23/2004 at 4:42 pm

    Scraping a site is ethical?
    At =-” href=””>Rolan the Datapimp, Alex asks himself if collecting data automatically with a script is legal or ethical. Actually I scraped site for data and I don’t care too much…

  2. By Many-to-Many on 3/1/2004 at 8:15 pm

    Is it OK to publish Orkut-collected datasets?
    Alex Halavais’ blog is home to an interesting discussion of the privacy / information property issues around the Orkut geomap we wrote about two weeks ago : part one, part two. It’s worth noting that Rolan (the datapimp / Orkut…

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>