Compilers Lecture 12: Parsing (1): Context-Free Grammars

Compilers Lecture 12: Parsing (1): Context-Free Grammars


>>And parsing will be based
on context free grammars. So in today’s lecture we will
be basically reviewing context free grammars. So the lecture will be about reviewing not introducing
context free grammars. So I assume that you already
know context free grammars and you have seen them in 135 and you have written
many grammars but we are reviewing them now
and then we will learn how to use these grammars
to build a parser, so. So in the first stage
of the compiler, the compiler will do scanning and the scanner will divide the
input string which is a program. By the way, your program
will be viewed by the scanner as a single character no matter
how big your program is how many lines of code you have. It will be viewed as a string. So if you have something
like X234 equals 579 and the scanner divides
this into words or tokens; so this is the, this
will be identifier. This will be a number. By the way in this when
we talk about parsing, there will be little
advantage in doing both integer and floating-point numbers. So we’ll just focus on integers. So there will not be any actual
benefit in doing both integers and floating-point
or real numbers. It will be the same concept
from parsing point of view. From scanning point of view
[inaudible] and we know how to rules for integers and how to
write rules for floating points. But from parsing point of
view, it’s a number is a number so in this, in our treatment of
parsing, we’ll just say a number and we will limit
ourselves to integers but it’s not very limited because floating points
will be treated the same way by the parser. And this is our assignment. And this is our semi
colon or end of statement. Now what’s interesting in the
compiler is that these parts of speech that the
scanner identifies, these will become the
symbols of the alphabet that the parser deals with. So our alphabet at the parsing
level, the alphabet is going to consist of an identifier. So this whole identifier is
just one symbol of the alphabet and this comes from the scanner. A number is another
symbol of the alphabet. You know the equal is another
symbol of the alphabet plus, minus, multiply, if, else. So from parsing point of view, each one of these
would be viewed as a single symbol
of the alphabet. So this is one symbol. This is one symbol
in the alphabet. So there are two levels. There is the scanning level
where we do the micro syntax and we identify the
words of the language and there is the parsing level
and at the parsing level, these words become individual
symbols in that language that we will be trying to recognize using
context free grammars. Of course you know, some
of them will be just individual characters. So this symbol is just
an individual character. But this from parsing point
of view, this will be treated as a single symbol and this
symbol is the key work else. So the scanner will
return to the parser. Basically, the way
this works is, the parser will call
the scanner. The parser calls the scanner and the scanner will
return a part of speech. So it will return you know,
identifier, number whatever, and it will return also the
actual lexeme or a clearer way of doing this is get next word. So the parser does
get next word. It’s calling the scanner to get
next word in the input stream. So with this example, the first
time this get next word will return an identifier. It will say, okay the
first word is an identifier and this is an actual lexeme. Then when the parser calls
get next word the second time, it will return this. It will tell it the next word
is the assignment operator and the third word is a number. Okay. Yes?>>I have two questions. So the scanner doesn’t
operate on the entire input and then just pass it
as a list to the parser. It actually gets
called for each word.>>Yes.>>Okay.>>Well in principle, you can
do it both ways but in practice, yes, it will return
one word at a time.>>Okay.>>The scanner returns
one word at a time.>>Okay. The second question is, when the scanner encounters
something like a number, does it do the conversion into
an actual number format or is that the job of the parser?>>Okay so all that the scanner
does is that it identifies this as a syntactic category. It says the syntactic
category number and then it will give the parser
the actual scheme, the lexeme, I’m sorry, which is 579, okay. So this is the lexeme and then
the parser deals with this. In fact you’re doing the actual
conversion is neither scanning nor parsing. Scanning is about identifying
words and parsing is about identifying
sentences or statements. But yeah, once this has
been identified as a number, we can do the conversion. It can be done as a, you know,
it can be done in scanning or in parsing but in
principle, it does not belong to either scanning or
parsing, conceptually. Okay. So now this is
going to be our alphabet but in today’s lecture, we will
focus on a simple alphabet. The basic alphabet that you
have probably seen in 135, which is the alphabet AB. The alphabet that consists
of two characters AB only to review the concept of
context free grammars. So in today’s lecture we’ll
focus on this alphabet but in subsequent lectures, this
is what we will be focusing, this is what our
alphabet will be. Okay. So you know, a
context free grammar, how would you describe
a grammar? Since this is a review, since
we are reviewing grammars, how would you describe
it or define it? Yeah.>>It’s supposed to have
rules of the language.>>Yeah. It’s a set of rules. Basically a grammar is a set of
rules for defining a language that recognize or that
generates a certain language and the language,
a formal language, when we say language
here in this context, it’s just a set of strings. Formal language is
a set of strings. So it’s a set of rules that
generate all the strings that belong to a given language. A set of rules that
generate all the strings that belong to a given language. Now that language and again, I’m
talking about formal languages, if that language is regular
then we don’t need grammars. We can recognize that language
using a finite automata or a regular expression. In fact, we can use grammars because grammars are more
general and the special type of grammars that in generally
regular languages, we call them, what do we call the grammars
that generate regular languages?>>Regular grammars.>>Regular grammars. Exactly so there are, there
is a subset of that grammars or subset of context
free grammars. In fact context free grammar
are also a subset of the big set of grammar context
sensitive grammars. But a subset of context
free grammars is the set of regular grammars that
recognize regular language. Now in here we focus on those
languages that are non-regular. That cannot be recognized
by finite automata and regular expressions. And maybe the classical example of a non-regular
language is this language. A set of strings W such that or
in fact I’ll write it this way. The language…ANBN of course
like I said in today’s lecture, we’ll focus on this alphabet AB. Such that N is greater than 0. So this is an example of
a non-regular language. It’s a language that cannot
be recognized by automata and regular expressions but can
be generated using a context free grammar. And the context free
grammar for this language is S So does anyone remember this? This is probably the first
example that you have seen in context free grammars. ASB or –>>Empty>>– or Epsilon. So we do need it. So this is a context
free grammar. Now how many rules for productions does
this grammar have?>>One>>Two. It has two. So this is one production. In fact, you know, these
are two productions that are merged together. We could have written
this as you know, as is ASB or S is Epsilon. So we could have written this as two separate rules
or productions. So each one of these is called
the rule or a production but it will usually, it’s more
convenient to write it this way. If there are multiple rules
that define the same symbol, we can do them this way. Where this bar stands for what?>>Four.>>Four. So this stands for all. So it’s like S is
either ASB or Epsilon. Okay? We can write it this way
or we can write it this way. Okay. Now in this case,
you know a grammar, what do we call the S?>>Non terminal?>>Okay, it’s the non-terminal
but it’s a special non-terminal.>>The start.>>The start. Yeah. So the grammar has, I think we can, have
to do this now. So a grammar consists of VTP
and S. So it’s a quadruple that consists of four sets. The V is the set of variables. Set of variables or
there’s another name for variables in this context. Variables are non-terminals. So variables are non-terminals. Non-terminals and
T is a set of –>>Terminals.>>– terminals. P –>>The production.>>– set of productions.>>Yeah. And S, there is
always one distinct variable.>>The start variable.>>The start variable. So in this case,
in this example, we only have one variable which
is S. We have two terminals A and B and S, the only
variable that we have happen to be the start variable. And we have two productions
that are written in this way. Okay? So this is an
example of a grammar. We’ll see more grammar
examples in this lecture. Now with the terminals, the point of grammar
is you can also view it as the alphabet for the grammar. It’s also the set of symbols
that can appear in a string that belongs to this language. Or a string that
derives from this grammar or generated by this grammar. Now in the context of
programming languages, is our sigma, like I said, it will not be AB it will
be sigma will be added to fire number, the
different key words, the different operators. Okay, so now let’s look at this. Let’s do what we
call derivations. So let’s derive derivation
for AA, AB, DB. So how do we derive it? We start with the start
symbol or start variable. Then in each step, we
make a substitution. So in order to derive this
string from this grammar, the first substitution is
going to be ASB and of course, you know terminals, you can
substitute for terminals. By the way, the here
I’m using the convention of following the convention of
writing terminals in lowercase and variable in uppercase so this is the convention
that I’m using. Terminals are in lowercase. Variables are in uppercase. So in our convention,
here is our lowercase and these are the
variables are uppercase. There are other conventions. You know, some people use, some
people underline terminals. So there is another
convention for writing grammars where people underline
terminals. So in this case we
substitute ASB for S, so at any step you
can substitute for S, either this or this. There are two possible
substitutions because you have two
possible rules or productions. And for this S we
also substitute ASB and we do it again, ASB
when the final step we substitute Epsilon. So the string that we have
just derived is this string. AAABBB. So the string
that gets derived consists of the leaves of the tree. So this is what we
call derivation tree. This is the derivation
tree and the string that gets derived would
have been on the leaves of this derivation tree
and we need it in order. Left correct. So this is basically
going to be, you know this is the first
symbol in the string. This is the second, third,
fourth, fifth, sixth. Okay? So this is the string
that we have just derived. So this is a derivation tree and
remember, you know in compilers, our goal is to read a
program written in a source, some high level language and construct what you
call the abstract syntax. Now the abstract syntax
tree is going to come from this derivation tree or
parse tree and the difference between the abstract syntax
tree and the parse tree is that in the abstract syntax
tree, we only keep the nodes that correspond to actual
operations and we get rid of the nodes that correspond
to grammatical symbols. You know we don’t care about
symbols in the grammar. We care about nodes
that correspond to actual operations you
know, such as add or multiply; operations that will get
translated into actual code because we are building
the abstract syntax tree in order to generate code. We will get to that. So don’t have to fully
understand the difference between derivation tree and the
abstract syntax tree in fact, until we are done with parsing. And parsing, like I said,
we’ll take three weeks. So this week we’ll basically
learn how to write grammars for actual programming
language constructs and the second week we will look
into one approach to parsing which it top down parsing
and the third week, we will look at watermark
parsing. So there are two
approaches to parsing. Top down and watermark. So each one of them will
take probably a week. Okay. So this is an example. By the way, you can also do the
derivation in a textual format so you can do the
derivation like this. S then you substitute for S ASB
then you substitute for this as another ASB so this
also derives ASBB. So what you did here is that you
substituted for this S. So this but substituted for
this based on this rule because I have a rule here
that says, I can substitute ASB for S. Then there is a third
step where we do AAAS, so I, yeah this should be a B. DBB? And again I have substituted now for this S. I have
substituted this and last step is substituting
and Epsilon for the S. So I would end up AAB, usually
we don’t write the Epsilon. You can write the Epsilon but
usually we don’t write it. And this leads to the string
that we are trying to derive. Okay, so let’s look at
another example now. So, how about M and greater than
or equal to 0 and M is greater than N. Now we another
grammar for this language. Okay. So a grammar for
this language, the language of those A’s and B’s
were the number these. Of course, A’s followed
by B’s where the number of B’s is greater than
the number of A’s. In fact, there are two
ways of thinking of this. So you can think of it this way. AABBBB, so you can think of it
as this is a string that belongs to the language of
equal A’s and B’s and then we are adding
some additional B. So this is one way
of thinking of it. Another way of thinking
this is the following. You can think of it as AABBBB
and you can think of it as this being the
core or the center or the middle part
of that string. So this is a string that
has more B’s than A’s. A string in which, you know,
the A’s appear before the B’s and there are more B’s than A’s. There are two different ways of
thinking of this string and each of these two different
views will lead to a different grammar. So in this case, I can
construct a grammar for this by saying okay, this
grammar will consist of S1 followed by S2. Where S1 is a grammar
that belongs to the language of
equal A’s and B’s. So S1 is going to
be AS1V or Epsilon. So S1 is a string that belongs
to the language of equal A’s and B’s and S2 is
going to be what?>>B’s.>>All B’s. All B’s.>>So it’s going to
be S2B then S2 –>>Yeah so it’s going to be BS2. So this is that incursion. We can –>>4.>>4.>>4B Epsilon.>>No.>>No, if you add Epsilon
too, it would be wrong. Why?>>You have only have used 1.>>Yeah because then you can
substitute Epsilon for S2 which means that you can have
equal number of A’s and B’s. So if you substitute Epsilon for
this, you will have equal number of A’s and B’s and that string
does not belong to this language because in order for this string
to belong to this language, it has to have more
B’s than A’s. So Epsilon is wrong.>>Okay.>>But you know, if this had
been greater than or equal to, the Epsilon would
have been correct. Okay, so equality
is not allowed. So this is one view
of this string. So S1 is a string
with equal A’s and B’s and S2 is a string
with a bunch of B’s. Now here, we can look at this
as a string that has S is ASB or so this all gives you
the center, the center part of the string or
B where B is what?>>At least 1B>>2.>>Yeah so it can be like S2. You can call it S2 by the way. You can call it any name. B or C. So just okay. So let’s do a derivation
for each one of them to understand what’s going on. So let’s do a derivation
for this. So let’s derive this string. To derive this string, we will
have S and S consists of S1, S2. Okay. I think this will not fit
so we’ll …okay, okay yeah. Let’s do it here so S is
going to be S1, S2 right? So this is our S1 and this
is our S2 and S1 is going to derive two A’s and two B’s. So it’s AS1B, AS1B
and then Epsilon. So now we derive that equal
number of A’s and B’s. Then we derive for
S2, we can BS2. I’m sorry. It’s BS2. BS2 then
what should we do?>>Get another B.>>Another B>>Another BS2. Another BS2 then oh sorry. [multiple voices]. So we only need one B so we
have to substitute one B. So now this gives us equal
number of A’s and B’s and this gives us two extra B’s. And again, the derive string
is, you can read the derives, derive string by looking at
the leaves from left to right. So this going to be a simple
number 1, number 2, number 3, number 4, number 5, number 6. Look at the leaves of the tree. So the terminals that appear
on the leaves of the tree, if you read them from
left to right in order. Okay. Let’s do the
derivation here. So the derivation for
this is going to be what? So it’s going to be, you
know, this is our B. Okay? And in the middle we have ASB. So what we will do, ASB. Then we will do ASB. Then the central part is going
to be a B and this B is going to consist BD and this is FB. Okay? So here this is number 1. This is number 2. This is 1, 2, 3, 4, 5, 6. Okay? Everyone now remembers
context free grammars. By the way, what do we mean
by a context free grammar? Why are these grammars, both of
these grammars are context free? What is a context free grammar?>>It’s because the variable
can appear anywhere within, within the symbol string. [inaudible].>>Okay. No you are defining
a non-linear or non-regular.>>Yeah.>>Yeah. So this is, what
makes a grammar context free as opposed to context sensitive?>>There is only one
derivation per each string. There can’t be more than one.>>One derivation
for each string? No –>>For a given grammar.>>For a given grammar,
there is only a string of derivation for a string?>>Yeah.>>So that’s the
definition of what? You’re defining something else. There exists a single derivation for a string then the
grammar is called?>>Non-redundant?>>Non-ambiguous.>>Non-ambiguous. If there are multi
derivations, it’s ambiguous. If there is only a unique
derivation, it’s not ambiguous. So you are defining ambiguous
versus non-ambiguous. But what’s the definition
on context free versus context sensitive? Context free grammar, all
the variables that appear on the left hand side
or whatever appears on the left hand side is
an individual variable. It cannot be multiple variables. So you cannot have something
like you know, ASB is ASBD or something like this. So on the left hand
side, what you have on the left hand side
is a single variable. That makes it context free. If on the left hand side you
have variables and terminals, then in that case it will
be context sensitive. And context sensitive,
we’re not interested in context sensitive
grammars in this course because programming languages
can be defined using context free grammars. So that’s why we are
limiting ourselves to context free grammars
because that’s what we need to define the syntax of
a programming language. Just we need to know
what context sensitive. All what we need to know is
what does it mean for grammar to be context free as
opposed to context sensitive. Okay. So everything that we will
cover here is going to be free, a context free grammar for CFG. Although I hate to write
this in a compiler course, because CFG in a compiler course
is an acronym for something else which is control flow graph. So it’s, I hate to use this
acronym because it’s overloaded. It means context free grammars
and it means control flow graph. Okay. So let’s take
another example. So about this grammar. This is ASB or BSA
or SS or [inaudible]. So this is a grammar. How many productions
do we have here?>>Four.>>Four productions. So each one of these that
are connected with the or, each one of them is a
different production. You could have written
each production separately. So this is for production
1, production 2, production 3, production 4. And we still have one symbol
or one variable which is S and two terminals, A and
B. Now the question is, what’s the language
that is defined by or that is generated
by this grammar?>>Even length strings?>>Even ->>Palindrome?>>It’s a, well it’s
more respective. So this is an even length string that does not belong
to the language. So this is an even length string that does not belong
to the language. [Multiple Speakers]>>It can’t be almost. So it’s either a
palindrome or not. There is no almost.>>Same number of A’s and B’s?>>Yeah, exactly. So it’s the same
number of A’s and B’s. And, by the way, the grammar
for a palindrome is this. So palindrome is ASA or
BSBE or so if you want it to be an even length
palindrome, you put an Epsilon. If you want it to
be an odd length; if you want to accept odd length
palindromes also you can accept A and B. So this is
the palindrome grammar. Okay so for this language,
it’s the set of strings that W such that there is a notation
for number of A’s that you would like to use which is
NA of W equals AB of W and we will define an A here. NA of W is the number of
A’s in W. And of course, NB of W is the number
of B’s in W. Okay. So it’s the language that has
in which the set of strings in which the number of A’s is
equal to the number of B’s. Okay. So now, this then
let’s derive the for example, let’s get the, so this
is the palindrome. So let’s do a derivation
for , for BABBBA. So in order to derive this,
we can think of this as, so this is one S and
this is another S. Okay. So we derive this. This is one derivation. S is SS, two S’s. The left hand S is going
to be AB repeated twice. So it’s going to ASBASD. Then what should I
substitute for this?>>Epsilon.>>Then the left hand, the right
hand I would derive D as A okay. And then I would
substitute Epsilon. Okay? So this is one derivation. In fact, you know I
could have derived this. This is also another
valid derivation. I can put SS, this S I can
substitute an Epsilon then this S is going to be
the same as this. So this, I can put this here. So this means that I have
at least two derivations. When I have at least two
derivations for the same string, then the grammar is ambiguous. This is an ambiguous
grammar because this is at least another derivation
for the same string. I can do this and
then this S is going to have the same
string that I have here. Okay so you have multiple
derivations for the same string, we call this grammar,
ambiguous grammar. And now, oh so let’s, now
let’s make a distinction between a left most
grammar, left most derivation and a right most derivation. So in a left most derivation,
we make the substitution for the left most variable. In a right most, we make the
substitution for the right most. What does this mean? So in order to, let’s do the
derivation for this again. So for this we do S.
We substitute for SSS. When we do the derivation this
way, so let’s do left most. When we do the derivation
this way, we again, we make one substitution
at a time. So now the question is, should we first substitute
for this S or this S? If it’s a left most
derivation, then we substitute for this S. It’s a
right most derivation, we substitute for this first. So this is a left
most derivation. When we substitute for the
left in this case, substituting for the left is ASB and the
right S remains the same. Then again we substitute
for the left most variable which is this. So we’re going to
do again, ASBBS. Then we do, what’s the next
substitution in this case?>>Epsilon.>>Epsilon for this. So it’s going to
be AAEpsilonBBS. Then the last step is going to be…no there
will be more steps. So it’s, we’ll start
substituting from this S then we’ll have
AAEpsilonBBAS sorry BSA. Then the next or the last
step is AAEpsilonBBBEpsilonA. For this the left most
variation in which substituted for the left most
variable, whenever we have, this is the sentential
form by the way. This is sentential. It’s any, its immediate step in
the derivation that has a mix of terminals and non-terminals. It’s a sentential form
or its immediate form. So the first step, we
have the start symbol. At the very end, we
only have terminals and in the middle we have a mix
of terminals and non-terminals. Okay. So let’s do right most. Right most derivation. Okay. SS then we
substitute for the right. We substituted for
the right is going to be SDSA then this is
going to go to SDEpsilonB. So we did this part first. We did the right first. Now we are left with the, we’re
done with the right most part so now we substitute for the
S. So we will do ASBBEpsilonA. Then this derives
AASBDBEpsilonA. Then this at the end
this, we substitute for the SAEpsilonBDDEpsilonA. And then you can get
to know the Epsilon because it’s the empty string. Okay. So now the point
is, if a string has, usually a string will have
a left most derivation and a right most derivation. Unless the grammar is very
simple there will always be a left most and a right most. So having both the left most
and the right most doesn’t mean that the language ambiguous. A language is ambiguous if
you have two left most parts, more than one left
most derivation or more than one right most derivation. For in other words, you have
more than one derivation tree. If the derivation tree; the derivation tree does
not specify the ordering for the derivation. So this derivation tree is
not left most or right most. You can’t call it left
most or right most because we don’t have
the notion of ordering. The ordering of the
substitution. So this is, this tree, this
tree corresponds to both, the left most and
the right most; so both of these derivations
correspond to this tree. So a grammar is ambiguous
if there are multiple trees. There are multiple
derivation trees. Okay. Now why do we
care about derivation? Because in fact you know
what parsing is all about. It’s about finding a
derivation for a string. So your program is a
string that belongs to that programming language
and the parser is trying to find the derivation for that. Given the rules, the productions
or the rules that we will write to define the programming
language, a parser is that stage in the compilation that looks
at the rules and the program which is just a string of
characters and it tries to find a derivation
for that string. And so your program, even if
it’s a medium line program, is a string that should belong
to that, that must belong to that language to find
why the context free grammar for that programming language. And then it will construct a
derivation tree for that string and then that derivation
tree will become the abstract syntax tree. So parsing is automated
derivation. So it is a program that
does the derivation. So we will be studying
some algorithms for doing the derivation. Now we’re doing the
derivation by inspection. We don’t have a specific
algorithm for doing the derivation. We have small grammars
and small strings and we are doing our
derivation by inspection. But this will not work
for a compiler that needs to parse a potentially, a program with a
[inaudible] length. It needs to parse a program that
has millions of lines of code. So we need an algorithm
to do this, to automate the derivation, to
figure out a valid derivation for a string or to determine
that no derivation is possible and if no derivation is
possible, what does that mean?>>Compile error.>>Syntax errors.>>It’s a syntax error. The compiler cannot find
then that’s a syntax error. Okay. So this is what we will
be doing in the next lecture. Next lecture we will be, we
will start righting grammars for actual programming
language constructs. So today’s lecture was
about reviewing the concepts of context free grammars
and derivation for left most derivation,
right most derivation and because grammars
all these parsers and next lecture we will
start writing grammars for actual grammars construct and we’ll start with
expressions. On the [inaudible] of
constructs is an expression. Any questions on
today’s lecture? Yeah.>>In the compilers
doing the derivations, what’s the time complexity
on that? Is it like fast or –>>Yeah it’s fast. So it’s going to be,
yeah we will see. It’s very fast compared to
other states in that compiler.

Leave a Reply

Your email address will not be published. Required fields are marked *