[the making of, pt. 4] Basic descriptions of the sample

This is the fourth in a series of posts about a paper I am writing, breaking down the process old-school. It started here. So, in part 3, I talked about how I got the sample of the users (and waived my hands a bit about the sample of the comments). Now, I want to tell my audience (and know myself) the basic structure of the sample.

Counting it up

I can say, for example, how many I collected (30,000), and what the oldest of these accounts is (December of 2004) and what the newest accounts are (cut off at May of 2008). I also want to say something about how many of these post comments, and how many comments I have.

The latter is pretty easy: I dump my comma-delimited file of comments into a plain text editor. I use Notepad2 for this sort of thing, because it has line numbering (making my job easier), and doesn’t–like the original Notepad–crash a Windows system when you try to open very large files. In total, 197,658 comments.


So, on average, that’s a lot of comments. But we know that it’s unlikely many people post the “average” number of comments, or even that it is distributed normally around that average. Far more likely is that you have a large number of people who post never or infrequently, and a handful of freaks enthusiasts posting every two minutes or so. What we need to do is count up the comments by user.

So we turn to Python again, and write up a quick script that goes through the 197K comments and counts up how many each user makes. In practice, the program doesn’t find all 30,000 users in our sample, because 23,532 have not posted a comment. The result is a comma delimited file with the user name and number of comments. Now we can construct a histogram.


I am a big fan of Excel, and we could use it to create the histogram, but I always seem to spend about 15 minutes figuring out how to do histograms in Excel, relearning it each time. The obvious choice is SPSS, but for a change of pace, I’m going to use a free piece of mathematics software called R.

The reason is simple enough. A quick run through a regular histogram shows that this is a heavily “powered” Pareto distribution. When I plot it as a regular histogram, it comes out as two lines along the axes, and I tiny curve at the origin. One person actually made 6,598 comments, and I had to check the site to make sure there hadn’t been an error. Another posted over 4,000 comments.

So, what we need is a log-log histogram. Although I’m sure there is a function that will do this for me neatly (and I have to admit ignorance when it comes to doing this in SPSS, but I suspect it’s just a matter of checking a box), I’m once again going to turn to Python to write a script that comes up with frequencies (i.e., how many people posted once, twice, … ntimes). I could “bin” these frequencies and come up with something lat looks like a regular histogram, but since folks are not as used to seeing log-log bar charts, I decided to do it without the bins. The resulting file is just a number on each line, starting with the number of people with one comment, the next line is the number of users with 2 comments, and so on. I drop this file into Notepad2 to take a look, and (CTRL-C) copy all the data.

I open up R, and first execute this command:

x <- type.convert(readClipboard())

This loads all of the data I just copied into a “vector” called x. If you are unfamiliar with the format of R commands, note that the <- is an assignment symbol: it says put the stuff on the right into the box on the left. The readClipboard function–shockingly–reads whatever is on the Windows clipboard. Type.convert converts strings into integers, since the clipboard just assumes whatever you are copying is a string (or character) rather than a number. Now we have all this stuff in the vector x.

Next, I issue the following command:

plot(x, log="xy", xlab="log(number of comments)", ylab="log(number of users)")

which produces the plot shown to the right. It should be pretty clear what each of the options there does, creating a log-log plot of the vector x, with labels for each axis.

Next: Hypothesis testing!

Now we have some basic descriptions of the data, enough to give the reader a feel for what we are working with. Time to rearrange the data a few more times and take measurements that will help us answer questions about the relationship of feedback scores to posting behavior, in part 5.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>