Dave Sifry provides a nice segue in the comments for part 1 by asking just what I mean by reciprocal links. Just to eliminate any appearances that I am gradually revealing a well thought-out plan, allow me to assure you that this is an unedited brain dump. It’s also a distraction for while I’m working out–which means that the brain is receiving a suboptimal amount of oxygen.
I mean by a reciprocal link simply a link that is matched by a link back from the target. In terms of the Moveable Type system, a trackback automatically creates a reciprocal link. I link to Liz Lawley‘s blog and she links to mine, so that counts toward my friend status. The Culture Cat blog links to mine, but until this sentence I did not link to her, and so her link (until just now) counted toward my star status.
For this to work, I think you need to ignore non-blogs. If I link to the Washington Post, I don’t really expect a link back. (Incidentally, one might say the same about a link to, say, Instapundit — both examples are akin to broadcasters in the audience they gather.) I won’t bother with a definition of a blog here. Existing data sets — the NITLE census and Technorati, among others — already have collected this data, and either could be used to provisionally define the borders of the blogosphere.
But friendliness as a measure for a single given blog, while of some interest, doesn’t get us too far. Perhaps we could look at the number of links to blogs that are reciprocated as a proportion of the total number of outbound links to blogs. Someone with a low number on such a scale would most likely be newer users, or for some other reason be relatively disconnected from a neighborhood of other bloggers, while someone who only links to those who link to them is likely to be a highly conservative member of a small (and less dynamic) group.
But this ranking of an individual blog really isn’t of much interest to me. Communication is all about relationships between things (and between people), so the only reason for highlighting the reciprocal link is that I think it is a more solid base for investigating more complex structures. So much of the literature in network analysis assumes symmetric relationships. In fact, despite the increased interest in non-symmetrical networks, many (most?) of the algorithms assume symmetrical, binary networks. I think those symmetrical links are the ones that make blogging (and the web) more interesting than broadcasting or other forms of macro-publishing.
Let me start conceptually to say that a unit, in the broadest sense, is something that has a significantly larger number of interactions (relationships) among its members than it does with elements not among its members. This very broad description is my departure point, and one that many readers will recognize as the basis of most clustering approaches.
My next step, and this may take some time, is to experiment a bit with clustering blogs, based on reciprocal links, and seeing what comes of it. I’ve done this before on other data sets, as have others (see, for example, Polanco’s notes: ). I have a feeling that moving beyond the individual blog and into blogging neighborhoods will be essential if we want to come to grips with what is now half a million blogs and may (or may not) explode to ten times that if AOL and others come on board.
I do see an issue with clustering, and that issue will require me to delve a bit into the literature. Hierarchical clustering is based on the premise that clusters do not intersect. That seems to me to be very wrong in the case of the blogging world (and most other cases that use clustering and PCA to find a lower dimension solution). But more complex solutions may be so computationally intense as to be impossible, especially for such a large set of data.
If we can come up with a two-dimensional map of clusters based on reciprocal linking, it might then be of interest to reintroduce the unidirectional links as an indication of “flow” or perhaps (for purposes of visualization) altitude.