Ah, the joys of the

modern digital age. Now as far as ages

go, we’ve clearly entered into the Cambrian

explosion of digital content. You see, we have

plethoras of data out there on the internet–

books, music, videos, even your favorite TV shows

are all available at the touch of a button or a

finger to any device you want, anywhere in the world. Now think about

this for a minute. I mean, we’re talking about

yottabytes upon exabytes upon googles of information. All flowing through

the internet, everywhere in the world,

every second of the day. Now since the late ’70s,

information theorists and computer scientists have

been utilizing compression algorithms to effectively

reduce the size of this content, to keep the internet from

coming to a standstill. And this is why today we

start hearing more and more about the connection between

conversion rates and data sizes for applications and websites. So for example, a user may

browse away from a website if it hasn’t loaded

in three seconds or may not download an

application because it can incur extra charges

by their carrier. Now savvy developers need to

take control of this fact. And that means bucking

the trend of just using standard compression

algorithms out there. Instead, you need to

start digging in deep, understanding your data,

and utilizing compression in new and interesting ways

to make your users happy. But fear not, young developer. I’m here to help

with that process. My name is Colt McAnlis,

and this is Compressor Head. In 1836, an American inventor

named Samuel F. Morse concocted a way to

send electrical pulses through a wire. Now, at the end of

that wire was actually attached an electromagnet. Himself along with

other physicists, realized that if they

could separate out the pulses of

electricity on the wire to control the

electric magnet, they could ring a bell every

time a pulse came through. And thus was born the

electrical telegraph system. You see, the entire

intent of this operation was to find a way to

send natural language code over great distances,

using a pretty robust medium. Now of course, they

needed to figure a way to devise a situation

where they could take an English

character– a, b, or c– and actually turn it

into electrical pulses– dot dash dot dot dash. And thus was born Morse Code. The way they were able

to assign this code was actually based on

the inverse frequency of a symbol occurring

in the English language. So for example, e, the most

common occurring letter in the dictionary, actually

was assigned the smallest code. The reason for this was

you have to remember, they have some guy

sitting there banging on the electrical magnet,

sending pulses through a wire. And they wanted to minimize

the amount of work that needed to be done to

communicate a message. And of course, all of

the other symbols on here are based upon

their probability. Getting more complex, the less

probable they are to occur. Now in modern

computing, we actually use a different sort of form. And that’s actually

called binary code. Now binary code is not a

variable length symbol. Instead what we

say is that you’re going to use a set

number of bits, and then we’re able to

assign a symbol to a given number of bits using

that static amount. So for example, ASCII

assigns eight bits to a single character. So when we get a stream

of these bits on the wire, we know we can read in

eight bits at a time and determine what symbol

we’re actually using. Now however, if we decided

to move the other direction and implement a frequency

change, much like Morse did– sorry. You see that we can actually

get to something a little bit better. Now what happens here is

if we take into account the probability or frequency

occurrence of the character and use that to assign

a variable length bit pattern to it, we

can actually start getting some interesting things. So for example, c occurring

in a given symbol stream 80% of the time, gets assigned

the shortest character, a single zero. Now what this means is

instead of assigning eight bits per character

for our entire stream, we can actually assign a

variable length of characters and actually end up in this

example getting 81% savings. This is the definition

of variable length codes. And it’s what we’re going

to talk about more of today. Now it would seem that you

can’t really have a discussion about compression without

actually going back and talking about information theory itself. And when you’re talking

about information theory, you can’t talk about

it without introducing this guy, Cloud Shannon. Now Cloud, who besides being

awesome at parties– really good with keg

stands– pretty much invented what we call

modern information theory. Now what Cloud did

effectively was he found that there was a

correlation between information content and the probability

of that symbol occurring in the stream. Now he was able to

actually measure this correlation with

the logarithm function and actually get a result

in the number bits. So effectively,

he could find out what the information content

was in terms of binary data. Now this itself was

a pretty bold move that actually defined how

information theory works. But for our needs, what

we’re more interested with is his

definition of entropy. See, when you take all of

the probability content and information content of a

stream and sum it together, you end up with the

algorithm for entropy. Now besides being

a little gnarly here, what we really find out

is that entropy is actually a matching of an estimate for

the minimum number of bits that each symbol has

to have on average to represent the stream. Don’t worry, we’ll take

a look at an example of that real quick. So let’s take a look

at this in practice. So let’s say we

have a stream that only has two symbols in

it, whose probabilities are represented by P1 and P2. So you can see here that while

the probability is 0.5 and 0.5 between these two symbols,

the entropy value actually gives us 1. Basically saying that

we need 1 bit per symbol to represent all symbols

in this stream, right? So two symbols, we have 0 or 1. That’s pretty good. Now as we start skewing

the probability, effectively one

symbol starts becoming more dominant in the stream

than the other symbol. So we start seeing

the probability of P1 start rising

and P2 start falling. We can see that the entropy

value ends up at about 0.88. Roughly meaning that on average,

we need 0.88 bits per symbol to represent all the

symbols in the stream. And this is good, right? This is where we come back

to variable length codes. Some codes are bigger,

some codes are shorter. To further this

point, you can see that we’re really skewed

down here in 0.99 and 0.01. The entropy is as

close to 0 as it can get, meaning

that we’re going to see a lot of bias for

P0 and a large code for P1. Let’s look at this a bit more. So let’s say we have something

very simplistic here. We’ve got four symbols, right? And each one has an equal

probability of 0.25. Now in this environment,

the entropy is actually 2. So we have to have

minimum 2 bits per symbol to represent all the

symbols in this stream. Which actually boils down

to something that looks like this– 00, 01, 10, and 11. This is pretty much the

basis of how binary works. If you know the number

of bits per symbol, you can read it pretty easily. Now to look at entropy

a little deeper. If we actually start changing

the probability of the symbols in the stream, we get

some different results. So let’s say we

have a set-up here, where we’ve got four

symbols, and each one has a different probability,

0.49, 0.25, 0.25, and 0.01. When you actually run this

through the entropy equation, you end up with about

0.157 bits per symbol. Again, this is on

average, right. Now this means that we can

start assigning variable length codes to the stream to

start achieving compression. Because we really can

only get compression once we have differing probabilities. So in this case, A, being

the most dominant symbol, actually gets the smallest

code word of 1 bit per symbol. And over time, we’re going

to end up with compression. Now remember that

entropy actually represents the best estimate

of the minimum number of bits required to

represent the stream. And of course, that also

brings us into the next point. If you’re going to be

using variable length codes and get compression

out of them, you need to make sure that

you assign the shortest codes to the most frequent

symbols in your stream. If you don’t follow

this principle, you’re not going

to get compression, and you’re really not

going to have a good time. So now let’s take a

look at how you actually use these variable length

codes in your software to start encoding and

decoding your data stream. So let’s say we start with

our symbol stream A, B, C. And of course, we also

have a symbol table. Now a symbol table is going to

map your given symbols– A, B, and C– to your given variable

length code words, VLCs. So A goes to 0, B goes to 10,

and C of course goes to 11. Now encoding is actually

extremely straightforward. All we have to do is read

a symbol off the stream, figure out what code word

it goes to, and then output that code word to

the bit stream. So let’s see what

that looks like. We take an A off the stream. And of course we find that

A just goes to a single 0. So we can simply output a

single 0 to the bit stream. Now of course we take B

off the symbol stream. Look through our

symbol table and find that B corresponds to 1 and 0,

which is pretty easy for us. We again put up a 1 and

then put up a 0 as well. And of course C, the final

one here, goes to 1 1, which again is

pretty easy to do. Now the encoding process here

is actually not that difficult, as I said. We start with symbols, we moved

to code words, and it’s done. Now decoding is a little

bit trickier though. Because in decoding,

we effectively have to start with

our bit stream, and then read

through and determine what symbols we’re

working towards. Now here’s the tricky part. Because our code words are

actually variable length, we don’t know how

many bits to read. Let’s take a look

at what that says. So we start with our

input stream, 01, 01, 1. Now we read a single

bit off of the stream. We check this against

our symbol table. And we realize that the first

symbol is actually an A, right? A is a single 0, we got a

single 0 off the stream, so we can actually put

A in the output stream. Now we read the next

bit off the stream. And it’s a 1. Here is where things get tricky. We don’t know if this is going

to be a B or C, because you could see that both of their

code words start with a 1. This means we actually have

to read another bit off of the stream before

we can actually make a decision about

which code this is. So we read a 1 and a 0. And we can see that it actually

corresponds to the letter B. That means we read those off,

we output a B to the stream. Now we do this again

for the next one. We read one bit

again, because we have to always start with

the least common denominator. And we realize that

that could either be B or C. We read in a secondary

bit and use that pair. And we realize that we’ve

just read the letter C off of the stream. Now this is all

fine and dandy when things are working correctly. But the reality is

that it’s really hard to design and create the

proper variable length codes. In fact, a lot can

go wrong in how you’re choosing what code words

are assigned to what symbols. And this is where things can

actually get really wrong. So let’s say we end up here. We end up with the same input

bit stream that we saw before. But notice that the code

word for the symbol C has actually changed. It’s now 1, 0, 1

instead of 1, 1. So let’s start decoding

and see what happens. We read the 0 off

of the bit stream, and we notice that

that’s an A. Fantastic. An A goes out to the stream,

and all unicorns are happy. However when get to the

next one, we go– read a 1, and we realize that that could

be either B or C. Fantastic. Still things that

we’re doing normally. However, when we

read the next 0, we’re still not clear if this

is a B or C. We can’t actually make a decision. It looks like a 1,

0 should go to a B. But this might actually

be a C. Because if we read the third bit, we could

actually make that decision. So what we’re looking at

here is because of the way we’ve encoded our VLCs, we

don’t know if this 1 0 1 0 bit pattern is actually

two Bs or a C and an A. This

actually gives rise to something you need

to follow when designing your variable length codes. And that is known as

the prefix property. Or more specifically,

once a code has been assigned

to a given symbol, no other code can start

with that bit pattern. So if we have 1 0

with B, we should not be able to start the

letter C with 1 0. That’s going to make us

have a really bad time. Of course, at this

point you have to ask yourself where do

variable length codes come from? I mean they don’t

just drop from the sky into your computer for

your encoding purposes. For this, we actually

have to introduce someone named Peter Elias. Now besides being a renowned

tap dance fusion instructor, he also invented beta codes. Or what we call today

modern binary codes. Now as we talked about,

binary codes actually aren’t variable length

codes, because they don’t obey the prefix property. But Dr. Elias didn’t stop there. You see, he actually constructed

an entire fleet of compression codes that he used

for various patterns. Now, in the early days

of creating these codes, you could actually take an

integer value, such as 0 through 5, and actually

use that to denote what type of generation on

the code you’re going to have. So for instance, this is

known as the unary code. Probably the most simplistic

variable length code that obeys the prefix

property that we have. The way that unary codes are

constructed is given a specific index, you actually append

that many 1’s– minus 1– to the front of it before a 0. So for example

down here at 5, you can see we’ve appended

four 1’s and a 0. 4 gives us three 1’s and a 0. 3 gives us two 1’s and so on. Of course, because of the

way this is constructed, 0 actually doesn’t have a

value representation here. Now what’s important

about the way that Dr. Elias actually

constructed his codes, was that they all were built

around a specific probability distribution. So unary codes are

actually optimal when all of your symbols follow

probability set forth by P to the n is 2 the negative n. Now if your symbols

don’t actually follow this probability,

you’re going to get a little bit

of bloat in your codes later on in the stream. And you’re actually not

going to hit entropy. But unary codes aren’t

the only thing he created. Let’s see here. We’ve also got gamma

codes, which of course you see have a completely

different construction process. That’s mainly because they

have a different probability distribution down

here in the bottom. As well as delta codes,

or a variant of that. And themselves are actually

constructed quite differently, and have this monster of a

probability in the bottom. And of course we have

the fabled omega code, which is known to many far and

wide for having one of the more complex probability

distributions that you can’t actually list it. It’s kind of like saying

Voldemort out loud. Anyhow. Back in the day,

the way that you would choose what variable

length code to actually encode with is you would actually look

at the probability of your set and sort of determine how

many symbols you would have, against which code is

actually going to give you the optimal encoding

for that set. So for example, if you have

somewhere between 64 and 88 unique symbols that

needed code word, you could actually find that

if the probability matched properly, you could

use omega codes and get the best bit

pattern you needed. Of course, if you

were lower than that, you may need to choose gammas. Or somewhere in the middle,

you may need deltas. Of course this brings us

to an interesting question. If you have a set of

symbols that don’t actually follow any of these known

probability distribution functions, what do you do? Well of course the issue

with that– let’s see here– is that there are hundreds

of variable length codes out there. You see, we’ve got Schalkwijk

codes, Phased-In codes, Punctured codes, Stout

codes, Fibonacci codes, and really there’s a

lot to choose from. Or you could do something

a little bit different and actually construct

your VLCs based upon your probability directly. Let’s take a look at that. In 1951, while David Huffman

was still a student at MIT, he was taking an electrical

engineering class where his professor

gave him an opportunity. He could either

write a term paper solving a very

difficult problem, or he could study for his final. Of course David, being a little

bit optimized at the time, decided that he would

rather solve a problem than study for his exams. Now the problem that the

professor gave him was try to find the

most efficient way to represent letters,

numbers and other symbols using binary code. Of course, David decided to

chomp into this full bore. And after months and months

of chewing on this problem was about to give up. Effectively, he had

given up and was to the point of

throwing away all of his notes, when it hit him. He would later say that

it was like lightning striking his brain as soon as

the papers hit the trash can. So what he developed

at that time would later be recognized by

Donald Knuth as one of the most fundamental methods of computer

science and data transmission that we’d ever discovered. See, what happened

was David Huffman found a way to represent and

create variable length codes for any given probability in

a very simplistic and compact binary tree form. Now later he went on,

of course, to figure out how to levitate

things with his mind. But until then, he

just used this code, which we are going to

take a look at now. Now the brilliance of

what Huffman discovered was actually a connection

between assigning the symbol to the code word

using the table. Now what he was

able to do here was find that if he

used a binary tree, he could effectively use the

probabilities from the symbol table alongside the

branches of the binary tree to create an

optimal binary code. Well let’s build a Huffman tree. So typically what

you’re given is a set of symbols with

their probabilities. And the first step

in the process is to actually sort them such

that the highest probability symbol’s at the

front of the array, and the least probable symbol

is at the back of the array. From here, what we’re

actually going to do is pop the two least probable

symbols off the array. And actually make them

childs of a new node that’s pushed in there. We actually continue

this process all the way down the line. Comparing the two last

elements of the array together and making them

children of a new value, that’s then pushed

onto the array. Now at this point we have a

fully created binary tree. And to move from this tree to

the generation of Huffman codes is pretty simplistic. All we have to do is

assign a 0 and a 1 to the left and right

branches accordingly. To actually generate the

VLC for a given symbol, all we have to do is walk

from that node up to the root and record all the 0’s

and 1’s that we trace. So for example, A moving

from its note to the root gives us 0. B moving from the node to

the root gives us 0 and 1. And moving from C to

the root gives us 1 1. This effectively

is the generation of our variable length codes

using the Huffman generation algorithm. And now that is a Huffman tree. It’s pretty easy to see that

a little bit of compression can actually get you a long way. And in something as simple

as statistical encoding, you can use variable

length codes to actually shrink

your data set. Of course, the next thing

you really want to talk about is what other

compression algorithms are out there that you can

apply to your data stream before applying these variable

length codes to actually get you below entropy itself. Of course, that’s a

topic for different show. My name is Colt McAnlis. Thanks for watching. [MUSIC PLAYING]