1   2   01 02 Structure of a Compiler 13m53s

1 2 01 02 Structure of a Compiler 13m53s


Welcome back, in this second half of the
lecture we’ll continue with our overview of the structure of a compiler. Recall
that a compiler has five major phases, lexical analysis, parsing, semantic
analysis, optimization, and code generation. And now we’re going to briefly
talk about each one, and we’re going to explain how a compiler understands these
with an analogy to how humans understand English. The first step at understanding a
program, both for a compiler and for a human, is to understand the words. Now,
humans can look at this example sentence and immediately recognize that there are
four words ‘this is a’ and ‘sentence. And this is so automatic that you don’t even
think about it but there is [inaudible] real computation going on here. You have
to recognize the separators, namely the blanks. And the punctuation, things like
the periods, and also clues like capital letters. And these help you to divide up
this group of letters into, a bunch of words that you can understand. And just to
emphasize that this is not completely trivial, let’s take a look at this
sentence. And you can read this, but it takes a little bit of time Because I’ve
put the separators in, in odd places. So you can see the word is, the word this,
the word a, and the word sentence. But again this isn’t something that comes to
you immediately. You actually have to do some work to see where the divisions lie
Because they’re not given to you in the way that you’re used to. The goal of
lexical analysis, then, is to divide the program text into its words, or what we
call in compiler speak, the tokens. So, here’s an, an example piece of program
text now, instead of a piece of English text, and let’s walk through this and
identify the tokens. So, there’s some obvious ones that are keywords, like if,
and then.>>And else that we want to identify. And then there are variable
names, things like X, and Y, and Z. There’s also constants, things like number
one, and the number two. And then there are some operators, double equals is one,
and the assignment operator is another. And here’s already an interesting
question. How do we know that double equals is not two individual equals signs?
How do we know that we want this? To be a double equal so we want, and not two
single equals. Well, we don’t know right now but we’ll talk about that.>>In the
lecture on how we implement Lexico analysis. But we’re not done with all the
tokens in this example either, there’s a few more. The semi colons, the punctuation
are also tokens, and then the separators are also tokens, so here’s a blank, that’s
a token, here’s another blank, that’s another token, and then there are lots of
blanks here that serve to separate things like the keywords and the variable names
and other symbols from each other. And those are the tokens of this example. So
for humans, once the words are understood, the next step is to understand the
structure of the sentence, and this is called parsing. And as we all learned in
elementary school, this means diagramming sentences, and these diagrams are trees,
and it’s a very simple procedure. Let’s look at this example. This line is a
longer sentence. The first step in parsing is to identify the role of each word in
the sentence. So we have things like nouns and verbs and adjectives. But then, the
actual work of parsing is to group these words together into higher level
constructs. So for example, this particular sentence consists of a subject,
a verb, and an object, okay? And that actually forms an entire sentence. So,
right here we have the root of the tree called a sentence, and that’s broken down
into constituent parts. The high level structure, as we said, is subject verb to
object. And then the subject is more complicated, as is the object. And this is
an example of parsing an English sentence. The analogy between parsing English text
and parsing program text is very strong. In fact, they’re exactly the same thing.
So here’s our little example piece of code again, so let’s work through parsing it.
So, clearly, this is an if then else statement, and so, the root of our
diagram, of our parse tree, is gonna be if then else. [inaudible] Nothing else
consists of three parts. There’s a predicate, a then statement, and an L
statement. And now let?s look at the predicate which consists of three pieces.
There’s a variable, a comparison operator and another variable and together those
form a relation. So the comparison between two things is one of the things you can
have as a valid predicate. Similarly, the then statement consists of an assignment
where Z gets one, and the else statement also has the form of an assignment, Z gets two.
And to, all together this is a parse tree, of the if-then-else, showing its
structure, breaking it up into its constituent pieces. Now, once we’ve
understood the sentence structure, the next step is to try to understand the
meaning, of what has been written. And, this is hard. So, actually we don’t know
how this works for humans still. We don’t understand, what happens after lexical
analysis and parsing. We do know that people do lexical analysis and parsing in
much the same way, that compilers lexically analyze and parse programs. But
frankly, understanding meaning is something that is simply too hard for
compilers. So, the first important thing to understand about, semantic analysis is
that compilers can only do very limited kinds of semantic analysis. And in
particular the kinds of things that compilers generally do are try to catch
inconsistencies. So, if the program is somehow self inconsistent, [inaudible]
compilers can often notice that and report errors. But they don’t really know what
the program is supposed to do. As an example of the kind of thing that we do in
semantic analysis, again, using an analogy in English, let’s consider the following
sentence. So, Jack said Jerry left his assignment at home. And the question is,
what, who does his, refer to here? It could be that his refers to Jerry, in
which case we would read, Jack said Jerry left Jerry’s assignment at home. Or it
could refer to Jack. In which case, we could read the sentence as, Jack said
Jerry left Jack’s assignment at home. And without more information, we actually
don’t know which one. His is referring to, whether it’s Jack, or it’s Jerry. And even
worse, let’s take a look at this sentence down here. Jack said, Jack left his
assignment at home. And the question is how many people are actually involved in
this sentence? It could be as many as three, there could be two separate Jacks
and his, could even refer to somebody completely different. We don’t know
without seeing the rest of the story. That surrounds this sentence, all the
possibilities for his. But it could also be as few as, only a single person. It
could be that Jack and Jack and his are all the same person in this sentence. And
so this kind of ambiguity is a real problem, in semantic analysis. And the
analogy in programming languages is variable bindings. So we would have
variables, in this case, a variable called Jack or maybe more than one variable
called Jack. And a programming language is going to have very strict rules to prevent
the kind of ambiguities we had in the English sentences on the previous slide.
So you know, in this example. Question is what value is printed by this output
statement, and the answer is it’s going to print four because this use of the
variable Jack binds to this definition here. And the outer definition is hidden.
So the outer definition is not active in this scope because it is hidden by the
inner definition and that is just a standard rule of a lot of lexically scoped
programming languages. Now the pilots perform many semantic texts besides
analyzing the variable bindings. And so here’s another example in English. So Jack
looked her homework at home. And, under the usual naming conventions, assuming
that Jack is male, we know there’s a type mismatch between Jack and her. So we know
that, whatever her is, it is not Jack. And, and therefore we known that this
sentence is talking about two different people. And so this is, analogous to type
checking. The fourth compiler phase, optimization, doesn’t have a very strong
counterpart in everyday English usage but it’s a little bit like editing. And, in
fact, it’s a lot like what professional editors do when they have to reduce the
length of an article to get it down to some word budget. So, for example, I have
this phrase right here, but a little bit like ending; and if I didn’t like it, if I
thought it was too long, I could replace the middle four words, with two words.
Akin to. So now it says, but akin to editing, and that means exactly the same
thing as the original phrase, but uses fewer words. And the goal in program
optimization Is to modify the program so that it uses less of some resource. Maybe
we want to use less time, we want the program to run faster maybe we want it to
use less space so that we can fit more data in memory. For a handheld device we
might be interested in reducing the amount of power that it uses. If we have external
communication we might be interested in reducing the number of network messages or
the number of database accesses. And there’s any number of resources that we
might want to improve other program’s use of. So here’s a simple example of the
kinds of optimizations a program might do. We can have a rule in our compiler that
says X equals Y times zero, is the same as X equals zero. And this seems like a real
improvement, because instead of doing the multiply, we can just do an assignment. So
we save some computation by doing that. Now unfortunately this is not a correct
rule. And this is one of the important things to know about compiling
optimization, is that it’s not always obvious when it’s legal to do certain
optimizations or not. Now it turns out That this particular rule is valid for
integers. Okay, so if X and Y are integers, then multiplying by zero is
always the same thing as just signing zero. But, it’s invalid for floating
point. And why is that, well because you have to know some details of the IEEE floating point standard, but there is a special number in the IEEE standard
called not a number and it turns out that not a number called a NaN times zero is
equal to not a number. Any particular non-number times zero is not equal to zero
If X and Y are plotting point numbers, you can’t do this optimization. In fact, if
you did this optimization, it would break certain very important algorithms that
rely on the proper propagation of not a number. Finally, the last
compiler phase is code generation, often referred to as Code Gen, and Code Gen, can
produce assembly code. That’s the most common thing that a compiler would
produce. But in general, it’s a translation into some other language. And
this is, entirely analogous to human translation. So just as a human translator
might translate, English into French, a compiler will translate a high level
program into assembly code. To wrap up, almost every compiler has the five phases
that we outlined. However, the proportions have changed a lot over the years, and if
we were to go back to FORTRAN I and look inside of that compiler, we would
probably see a size and complexity that looks something like this. We have a
fairly complex lexical analysis phase, an equally complicated parsing phase, a very
small semantic analysis phase, a. A fairly involved optimization phase and another
fairly involved code generation phase. And so we see a compiler where the complexity
was sp, spread fairly evenly throughout except for its semantic analysis which
is very weak in the early days. And today if we look at a modern compiler you’ll see
almost nothing in lengthening, very little in parsing, because we have extremely good
tools to help us write those two phases. We would see a fairly involved thematic
analysis phase. We would see a very large optimization phase, and this is in fact
the dominant component off all modern compilers, and the a small code-generation
phase because again we understand that phase very, very well. That’s it for this
lecture. Future lectures, we’ll look at each of these phases in detail.

Leave a Reply

Your email address will not be published. Required fields are marked *