# Binary Encoding

On Wednesday, I prematurely aborted the discussion of coding. I’ve gotten a couple of emails on this now, the one most recent from Richard.

To remind you, we had a four-symbol code, consisting of H, A, L, and _. If these symbols were equally distributed, we the number of bits required is 2. We know this because the probability of each character is .25. So the number of bits required is

H = – ( .25 log2 .25 +
.25 log2 .25 +
.25 log2 .25 +
.25 log2 .25 )

or 2.

It’s pretty easy to figure out how to code it into 2 bits:

H – 00
A – 01
L – 10
_ – 11

But, it is not true that (for example) in an English text the probability of E and Z occurring in a string is equal. E appears much more frequently. So, let’s say we observed a pattern that looked like

HAHHHLA_HHAHH_HALH_AHHH_HAHLHAHALH_HALAH

In practice, of course, you would want to have a lot more to go on, but let’s say this was enough to assume it was representative of the language. In this case, of 40 total characters, 20 are H, 10 are A, 5 are L, and 5 are _. Now, our probabilities are different, and so the number of bits required are also different:

H = – ( .5 log2 .5 +
.25 log2 .25 +
.125 log2 .125 +
.125 log2 .125 )

Which ends up as 1.75 bits. How do you fit four symbols into 1.75 bits? Well, the trick is that a message fits in–on average–1.75 bits per character. That means that at least one character needs to fit into a single bit. In our case, the bit we would choose is clear: the H. H appears far more frequently than the other characters, and so it can be compressed more.

H 0
A 01
L 001
_ 0001

The problem here is that if H is 0 it can’t be a part of A. Otherwise, try to decode this:

0010001

It could be decoded in a few ways:

HAHHA
LHA
etc.

And that is no good at all.

H 0
A 10
L 110
_ 111

So to encode HAH_LAH you would end up with

0100111110100

You will see that when you chain these together (and you know where to start), you have only a single way of decoding.

Consider that chain.

We start with a 0 which can only be an H. It can’t be anything else, because H is the only character to start with a 0.

So we have H.

The second bit is a 1. Actually, this could be A, L, or _, so we’ll keep going.

The third bit is a 0, meaning we are up to 10. There is only one character that starts 10, so we know it is an A. We have HA, so far.

The fourth bit is a 0. Again, only an H starts with a 0, so we now have HAH.

The fifth bit is a 1. There are still three possible characters that start with a 1 bit: A, L or _. So we need to look at the next one.

The sixth bit is a 1. There are two possible characters that start with 11: L and _. So we need to go to yet another bit to figure it out.

The seventh bit is 1. So we have 111 and _ is the only possibility.

You get the idea.

Another way to think about the code is as a binary tree:

0 /\ 1
/  \
H  0 /\ 1
/  \
A  0 /\ 1
/  \
L     _

This idea most directly important to schemes for efficiently encoding data; but it turns out that encoding data, as we have been reading, may have importance beyond digital computing machines and communication systems. It may be the stuff that the universe and life is made of.

The next step, which we did not get to, is to look at larger patterns. If we can save a quarter-bit by looking at single letters, what if we go to pairs? Q is followed by U far more often than by any other character. Therefore, the amount of information required to encode the pair is reduced. We can find similar regularities in bunches of 3, 4, 5, and 5000 characters (though we have diminishing returns for encoding).

Remember, Morse’s original idea was to encode whole messages to a particular number. This would be far more efficient than Morse code, and was a coding scheme used by the ancient Greeks (among others). He then thought about a dictionary, with a number for each word in the dictionary. This too would be a more efficient coding, though it would require more work on either side to encode/decode.

Finally, I was ambitious and hoped that we would get to talk a bit about finite state machines. The tree graph above shows how it would be easy to create a simple finite state machine to decode the string of 1s and 0s and form a message. Unfortunately, some portion of the class required time to grok. I do encourage you to work through some of these issues on your own, and for those who are reading from the Com Theory class, we’ll go beyond this a bit to talk about evolutionary models of finite state machines.