Compiling C to printable x86, to make an executable research paper


Hi, I’m Tom 7. Today I want to talk to you about a research
paper, that is, at the same time, a computer program. What does that mean? Well, just because a file says, in its name,
that it’s an EXE, or executable program, doesn’t mean that it necessarily is. It also doesn’t mean that it isn’t both an
executable program and something else at the same time. Like, a text file, for example, containing
an interesting research paper. Now, I didn’t just find this research paper
— which is also a program — I made it, for SIGBOVIK this year. So I want to talk about how I did that, because
I think it’s interesting. We’re going to learn about compilers, we’re
going to learn about some bits, and some low-level assembly, and hopefully have some fun. Um, let’s get started! (guitar music) Every file on the computer is just a bunch
of bytes. And we’ll talk more about bytes in a second
because there’s a particular thing about them that’s important for this work. But I’m not going to belabor it, because first
of all, I know there are a lot of people that already know everything about bytes. And then there are many people watching that
don’t live and breathe this stuff. Both will probably be bored by the details. But also, the details are in the paper. A computer program is identified by a specific
pattern of bytes. And the important thing is that it begins
with a header, and we’ll talk about the header in a little bit, and then it contains some
data, and it contains some code. And all that it means to be a program is that
it looks like this. Okay? And then of course it should work; it should
have some kind of functionality. The functionality is provided by this code. You may know about source code. Like something written in the C programming
language, or a Python script. That’s not what I’m talking about here. This kind of code is represented as bytes,
and it is the native instruction set — the native language — of the computer. The chip actually understands this part of
the program directly. Okay? In fact the way it does that is: It’s going
to just load this file into memory. And then it’s going to say — and roughly
speaking, it just puts the file somewhere in memory. And then it’s going to know from something
in the header that it should sort of start here. And this is called the instruction pointer. The thing that says where we’re currently
executing. Then it’s going to load up some instruction,
from the bytes in the code section, and it’s going to execute that. Then it’s going to go to the next instruction. The computer, when it looks inside a program
and starts running it, interprets a byte as an instruction and it knows most of the bytes
to mean something. The byte 40 equals “INC AX,” where AX is some
temporary inside the processor called a register, and it contains a number and it basically
says: Increase that number by one. Increment. Some instructions are not just one byte. For example, 6A followed by 32 means “PUSH”
the value 32 (some byte) onto the stack. And the stack is just some other thing that
the processor has in it. Bytes also have meaning–there’s a standard
way to interpret them–as characters in a text file. And when you look at them that way, 40 means
the @ sign, 6A is a lowercase J; 32 is the digit 2. Now only some bytes are considered printable. And those are the ones we’re going to be interested
in for this paper. Printable means that there’s an agreed-upon
character–basically something that appears on your keyboard–that everyone on all computer
systems agrees: That byte means that thing. Now the printable bytes start here, with a
space, which I’m going to draw this way. We’ve got an exclamation mark; quotes; “hashtag”. So here we have the printable subset of bytes. Outside of this region is a bunch of hooey. This character for example deletes. This character beeps. And up here you could put some of the characters
together to create emoji like the pile of poo. If you look inside almost any program, you’ll
see a combination of all these characters; you’ll see printable ones but you’ll see also
beeping, and deleting characters, and maybe even an emoji. Here’s an example of what that looks like. Total gibberish. Okay, now I think I can really explain what
I did. So we’re going to take a program, written
in the C programming language, to be kind of old school. And here I’m talking about source code. So this is a thing that a normal programmer
would write. Sort of. And feed it to a compiler, called ABC, which
I also wrote. It’s particular to this project. And that is going to output another file,
an executable file, which means it needs to have a header and data and code, but every
byte in this entire file will be a printable byte, so I can only use those 95 bytes. This also is not to scale. This paper ends up having to be really really
big. About 420 kilobytes. The reason this file needs to be so large
is that the header, which will describe sort of where everything is, and how big it is,
needs to be made of printable bytes, so we’re not able to use those small bytes. And if you know anything about this problem,
I’ll mention right now that I do this in “expert mode.” What I mean by that is that the code will
not be self-modifying. So it will only execute instructions that
are within the printable subset. This beast here is Intel’s manual, volume
2, instruction set reference, for the x86 series of processors. And pretty much everything that your computer
can do is by way of the instructions that are in this book. So here I’ve opened it up to a random page;
this is the cosine routine on floating point numbers. And the restriction to printable bytes in
our project today is going to leave us with about this much of the book left. A very small fraction of all instructions
available to us. So here are all the instructions in the printable
subset of x86. The blue is the opcode byte; the black is
the name of the instruction, which is what you’d find in the book, or programmers often
think of the instructions by these names; and the red means that it takes some argument,
that describes for example in subtraction, which two things to subtract. Now those red argument bytes also have to
be printable, and that limits what we can do. Suffice to say, it’s complicated, but this
table in the book shows us that only one small segment of all addressing modes are available. And in particular I can’t just subtract one
register from another. I always have to involve exactly one register
and exactly one weird memory indirection. Unfortunately we’re missing some really important
instructions that are in most programs. For example, we don’t have the MOV instruction
which just moves data from one place to another. And we also don’t have the interrupt instruction,
which is how you’d normally communicate with the operating system to do things like print
things to the screen, or even exit the program, which is going to turn out to be a difficult
issue. And although we have a family of jump instructions,
which is important, because that’s how we’re going to change the instruction pointer around
to go to different instructions and have some kind of loops in our program, the argument
that the jump takes needs to be a printable byte, and this is going to mean that we only
can make jumps that are from 32 to 127 bytes forward. If you can only jump forward, the instruction
pointer will always just go further and further down the program, and you won’t ever be able
to make a loop, and that makes programming very very hard. So, compiling to the printable subset of instructions,
with printable arguments, requires solving some puzzles. And it’s kind of a miracle that they can even
be solved; sometimes just barely! I think it’s kind of cool, so I want to walk
you through some of those, just to give you a flavor of what it’s like, and maybe learn
something. And the first one is just the computation
of the bitwise OR function. The OR function operates on bits. Let’s suppose that we have two numbers: A
and B. “A OR B” means to look at every pair of bits, and if A contains a 1 or B contains
a 1, or both, then the output is 1. But if both of them are 0, then the output
is zero. So, in this case–which covers all of the
possibilities, right?–the output is 1110. Now we don’t have this instruction, but we
can simulate it using some instructions we do have. One of those is AND. For AND, the output is 1 only if both A and
B have a 1 in that position. Otherwise it’s 0. Another instruction we have is XOR, or “exclusive
or.” For exclusive or, the output is 1 if exactly
one of the two bits is 1; 0 if both are 1, or if both are 0. Now observe the relationship between these
two things and this output that we want to achieve. We have ones in all of the right positions,
so in fact if we could OR these two things together, we’d be done. But of course we don’t have OR. However, we now have the case that there’s
at most one in any column; either 10, 01, or 00. Which means that we could add these two together,
and we’ll never need to carry, and we’ll get the OR that we desire. This basically relies on the fact that PLUS
is the same as OR if you never have two 1 bits in the same position because then you’d
need to carry that to the next position, just like in decimal arithmetic. Now, we don’t have a PLUS operation, but we
can implement PLUS using subtraction. And basically what we do is subtract the negative
of a number in order to simulate plus. In two’s complement binary, the negative of
a number is obtained by flipping all of the bits, 0 to 1 and 1 to 0, and then adding 1. We can add 1 with the increment instruction,
which we already know is printable, and we can flip all of the bits by doing an exclusive
or operation with a constant that’s just all 1s. So by doing tricks like this, we can decompose
operations that we don’t have, like OR, into instructions that we do, like AND and XOR
and increment. And this is really the essence of compilation. This is what all compilers do when they translate
source code, which contains operations that the computer can’t do directly, through a
variety of transformations into much simpler code, but longer code, that’s phrased in terms
of instructions that the computer can perform. So in the ABC compiler we first take in the
C programming language, and parse it to produce an abstract syntax tree language. Then we do something called elaboration to
get rid of some of the high level constructs like “for” loops and turn those into GOTOs. That results in the C Intermediate Language,
which is bigger but simpler code. Then we convert that to a low-level language,
LLVMNOP (ha ha), where we make the layout of data explicit, and then finally perform
some more transformations to larger but simpler code, into the printable x86 language, and
then output the paper. We also perform some optimizations within
a language itself, optimizing CIL to itself, and performing temporary allocation within
the LLVMNOP language, to itself. ABC is something like 13,000 lines of Standard
ML, and a lot of this is normal for a compiler, despite the fact that we’re solving some abnormal
problems because of the output language. Now transforming complex things into simple
things does require good simple building blocks. And one that I’ve already relied on is the
ability to load a particular constant value, like 1111, into a register. Although most programmers take this completely
for granted, this is surprisingly hard in the printable subset, because for one thing
any byte that’s not printable can’t just be loaded or added into a register. So now let’s talk about the apparently simple
task of loading a particular value into a register. Normally we would just load a value into a
register using the MOV instruction. This doesn’t work because the MOV instruction
isn’t printable, and if the value we want to load isn’t printable, that won’t work as
well. The first observation is that we can at least
make this register have some known value. We do that with the AND instruction. This byte, which is space, has only one 1
bit in binary. That means if the A register contains something,
and we don’t know what yet, and we AND it with that value, all of these values will
become 0, because they’re ANDed with 0. Only this one slot will remain possibly 1. Then if we perform another AND, but with the
value 40–which also has one 1, but in a different position–that’s guaranteed to turn off this
one remaining bit that might still be 1. After doing these two ANDs, we know that A
contains the value 0. Now what I’ll do is build a big matrix. And the matrix will tell me how to get from
some byte that I have in a register (on the left here), to some byte that I want (which
I have on the top here). And each cell will give me a series of instructions
to execute to transform that value into the desired value. Now when the input equals the output–that’s
the whole diagonal here–I don’t need to execute any instructions. So I can start by just filling in the diagonal
with the empty instruction list, which I’ll write as a dash. And this matrix actually has 256 values on
this side and 256 values on this side; I’m not going to be able to fit it on the paper. Now we know that we have the increment and
decrement instructions, which means that adjacent to this diagonal, I can get from 3 to 4 by
just executing “increment AX”. This will be true for this entire thing, and
similarly decrement. Now in fact I can get to any number from 2
by just incrementing 2 over and over again, so I’ll initialize this entire matrix with
repeated applications of increment. This one will have three copies; this one
four; and similarly for decrement. That demonstrates that I can get from any
number to any other number through a very long series of increments or decrements. That totally sucks, but it gets us started. Now the next step is to look at this matrix
and explore possibilities for shortcuts. For example, if I have 6 in the register,
and I apply AND with say, 20, because 6 and 20 don’t share any bits, that will give me
zero. So I’m able to replace whatever was in here,
which was DEC DEC DEC DEC DEC, with just AND 20. In fact most of this column I’ll be able to
fill with AND 20 or AND 40. Now the neat thing is, that because this gives
me a plan for getting from any number to any other number, if I’m trying to figure out
how to get from 6 to 1, for example, one of the things I can do is say: Well, what if
I go from 6 to 0? That’s just one instruction. And then I go from 0 to 1. That’s just one more instruction. And what that gives me is that this cell,
which used to be “decrement decrement decrement decrement decrement”, can now be AND 20, yielding
0, and then increment one time, yielding 1. In addition to the AND instruction I can also
have SUB, and XOR, and some other funny tricks I can play in here. So I can take this matrix and I can simplify
it by applying shortcuts, and then composing operations where I go from one value to another
value, and then use that value’s shortcut to where I’m trying to get, over and over
and over again, which makes the matrix better and better by making these cells faster and
faster. And I can just keep doing that until I get
to a minimal matrix. And it turns out it takes no more than four
instruction bytes to get from any one value to any other value, which is pretty good. What I really like about this is how beautiful
the matrix looks when you just think about how many instructions it takes to get from
one value to another and plot that in a graphic. Here’s what that looks like. This kind of fractal structure shows up a
lot in computer science, and mathematics, and Hyrule. For example a paper I wrote a few years ago
about drinking games had a similar kind of pattern; I’ll just show you what that looks
like in three dimensions. This is only for eight-bit bytes, but you
can generalize this to sixteen-bit words, and you can play tricks with the stack in
order to load each individual byte into a word, and that all works out. Okay, two more puzzles and then we can execute
this paper. I mentioned before about the jump problem,
which is that all jumps that we have access to will jump forward in the program. Jumping forward is great, but we also need
to be able to jump backwards, in order to create loops. For example, in the C programming language,
a “while” loop just keeps executing the same piece of code until some condition becomes
false. At first blush this seems impossible, because
if I’m only jumping forward, the instruction pointer’s going to only ever get bigger. But there is one condition under which the
instruction pointer gets smaller. The instruction pointer is the thing that
tells us where in the current program we’re executing. And this is a thirty-two bit number; that’s
a lot of bits. That means it can represent a number up to
2^32, which is approximately 4.2 billion different positions in the file. And our file is not 4.2 billion; it’s only
420kb, and in fact the usable code section is only 64 kilobytes large. If this is our program, the instruction point
starts here and it jumps forward a few times; we’ll get to the end. If I execute past the end of my program, I
will just keep going into memory, into whatever’s there, executing all sorts of weird instructions
and probably doing stuff that I don’t want. FOR EXAMPLE, executing non-printable instructions! So this is no good. But there is a strange fact, which is that
if we’re only using the 16-bit portion of the instruction pointer, that is, if these
are all 0s, and we issue a jump that happens to cross the boundary here (this boundary
is address FFFF, that is, the largest 16-bit word, all 1s in binary), then execution will
wrap around. It won’t continue into the 32-bit portion,
and it goes back to the beginning, modulo 2^16. This is kind of like when your odometer rolls
over, or when you shoot so many ducks in Duck Hunt that your score goes back to zero. And weirdly it only happens when your instruction
pointer is already in the 16-bit range. Which means we can use this trick but only
if our code stays within the 16-bit region. This is where that’s documented in the manual. I should say you shouldn’t necessarily trust
this manual; it has a bunch of typos, like this is a mistake right here. Really Intel? Backslash? So our whole program will fit within an address
space, where the beginning is 0000 and the very last instruction byte is FFFF. Of course not to scale. And the program will consist of blocks. Blocks have to be of a certain size. They can’t be too small and they can’t be
too large, because what we’re going to do is jump between those blocks and every time
we ever want to make a jump, we’re always going to jump to the next block. And we have to be careful to put some jump
at the very end of this program, so that it jumps, which wraps around, and brings us back
to the start–that’s the whole point–but if we do that, then from any place we can
technically reach anywhere else we want to go by doing a series of jumps. This series of blocks I call the “ladder.” Each block starts with what I call a “rung”,
because, you know, ladders, and the rung always decrements a register called SI, and then
says “jump not zero” to the next block. So basically what will happen is if I want
to jump a certain number of blocks, I will load SI with that number, so if I want to
go 5 I’ll start with 5, and then I’ll jump here, decrement SI so that it’ll say 4, and
then I’ll jump not zero which means, 4 is not zero, so I’ll jump to the next one and
then it’ll become 3, and then 2 and then 1 and maybe wrap around. When it finally becomes 0, I will not jump
and I’ll instead enter the block and execute it. We can also put jumps inside this block. So there could be a “jump if greater than”
some value, and this always targets the next block. So if we’re executing one of these jumps I’m
going to load SI with the right value. This is very reliable but it’s an extreme
pain in the butt to lay them out, because no block can be too long and no block can
be two short, because I can’t even jump 10 bytes; I have to jump at least 32. Moreover, this whole process has a dependency
on itself because I need to know how far to jump, I need that to be printable, but I don’t
yet know how big these blocks will be. And unlike a normal assembling process, even
a basic thing like loading a value into a register–for example how many blocks to jump–depends
on the specific value being loaded. (Because, recall that table that we just computed.) So I have to do this in an iterative loop. I have to do this over and over again, laying
it out, hoping that it will work, and then making minute adjustments until it does. The other bummer about this is any backwards
jump has to go through the entire program, and so some operations, like multiplication,
which are implemented with loops because I don’t have the multiplication operation, take
time that’s proportional to the size of the program instead of just being a single operation. The last trick to talk about is communicating
with the operating system. And basically the story is: We can’t do it. This is normally done via something called
an interrupt, which is the instruction whose opcode is CD. And the most important interrupt to know is
INT 21. In DOS, this interrupt basically says “Hey! Check this out!” You load up the registers with something you
want to do, DOS takes over, and it does something like write to the screen or open a file. This byte CD is not printable. Now there are some close calls that I describe
in the paper, that almost allow us to execute interrupts. But one trick that I did find out involves,
the way that interrupts themselves actually work, which is kind of neat. This is a very low-level concept in the processor,
and as a result, whenever an interrupt happens, what the processor actually does is it looks
inside memory, and it looks inside a specific part of memory, which is the very beginning. And based on the interrupt number, it’s going
to pick one of these slots, and it’s going to read out of that slot an address that will
be the new instruction pointer. Since each one of these is four bytes long–32
bits long–the INT 21 handler will be at position 0084 at the beginning of memory. This thing is called the interrupt vector
table, because it gives you a table for each interrupt of where to go. Now what’s useful about this is that some
interrupts aren’t caused by the interrupt instruction. For example the timer interrupt is this one,
and this interrupt just automatically happens because of the system clock every few milliseconds. So we could put something in here, like for
example the interrupt 21 handler, by just copying this into wherever the timer interrupt
handler is, and then the computer will periodically execute the interrupt 21 handler. That’s not very useful for us, because we
don’t want this to just keep happening every few milliseconds, and there are some technical
reasons why it won’t even work. But there’s one other way that interrupts
happen, and that’s when we execute an illegal instruction. That’s interrupt number 6. We can control when an illegal instruction
happens as long as we have one available to us, by just trying to execute one. So what we do, at program startup, using some
hand-written assembly that’s only printable, is copy the address from the INT 21 handler–that’s
DOS’s system calls–over top of the illegal instruction handler. Now if we can execute an illegal instruction,
it will execute the same code that would happen if we called INT 21 normally. And do we have an illegal instruction? Yes we do! This one here is Adjust RPL Field of Segment
Selector, which is completely useless in user programs, because it’s always illegal, and
what I love about this instruction is this is a lowercase letter C, which means–this
actually takes an argument–we can encode this illegal instruction as lowercase letter
C, lowercase letter Y, lowercase letter A, and just an exclamation mark for emphasis. This turns out to be this oddity, which is
illegal. Now for technical reasons we can actually
only use this trick one time before it stops working. So what I’ll use it for is to invoke interrupt
21 with AH equals 4C, which means “exit the program.” And that’s okay, because we only exit the
program once. YOEO. And this is why “cya!” is such an appropriate
encoding. Let’s actually look inside the paper for a
second. Most of this paper is text, and it’s just
text that’s part of the data, or header even, of the executable. There’s some drawings in there too. Once you get far enough along, we find the
code. It’s not something that you can read, but
it’s all made of keyboard characters. And if we look at the end here it says “LX4
cya!” and that is where the program will be executing when it exits. Sorry, I know this video’s getting pretty
long, but speaking of boring, a program that doesn’t do anything but loop and then exit
is not very interesting. And since we’ve used up our trick to contact
the operating system, which is how you’d normally do input and output, we need another way to
do it. Fortunately there is a low-level way for us
to do that. This is a pinout of the Intel 8086, and the
Intel processors, like most microprocessors, have a way to communicate with peripherals
that are on the motherboard, by using something called I/O ports. An I/O port is pretty simple. Normally you’re reading and writing from the
main memory of your computer, which is what we’ve been talking about the entire time when
we’re talking about addresses. We use these data pins here in order to specify
the address, and we also read or write the data on those same pins. This pin over here allows us to switch that
behavior to instead read and write from peripheral addresses, which basically instructs these
pins to refer to a different part of the motherboard. That’s basically all that I/O ports are, but
it’s a very low-level concept in the processor. Now it just so happens that we do have a pair
of slightly awkward instructions for reading and writing the I/O ports. And we’ll use these to communicate with some
peripheral. So this paper doesn’t print anything out,
but what it does do, and I’m about to show it to you, is play music playing this ancient
FM synthesis sound card called the AdLib. Let’s go take a listen. Okay, here we are, finally, in computer world. I’m using a DOS emulator called DOSBox, but
this also works perfectly fine in real DOS, as long as you have the right peripherals. I wanted to show you that the paper is all
printable bytes, but unfortunately it’s too big to even load into the editor. But we can run it, and what this program does,
is it takes on the command line a description of music in a format called ABC. I didn’t invent this format. So if I say “ABC” — (music plays) — It’ll
play those notes and then exit. It accepts more complicated input, including
the possibility of playing more than one track at the same time. (Polyphonic music plays.) Now I’ll just leave you with the built-in
tune, which plays if you don’t provide it with any music of your own devising. Perfect music. Here we are. (Scratchy FM version of Rick Astley’s “Never
Gonna Give You Up”)

Leave a Reply

Your email address will not be published. Required fields are marked *