>>And parsing will be based

on context free grammars. So in today’s lecture we will

be basically reviewing context free grammars. So the lecture will be about reviewing not introducing

context free grammars. So I assume that you already

know context free grammars and you have seen them in 135 and you have written

many grammars but we are reviewing them now

and then we will learn how to use these grammars

to build a parser, so. So in the first stage

of the compiler, the compiler will do scanning and the scanner will divide the

input string which is a program. By the way, your program

will be viewed by the scanner as a single character no matter

how big your program is how many lines of code you have. It will be viewed as a string. So if you have something

like X234 equals 579 and the scanner divides

this into words or tokens; so this is the, this

will be identifier. This will be a number. By the way in this when

we talk about parsing, there will be little

advantage in doing both integer and floating-point numbers. So we’ll just focus on integers. So there will not be any actual

benefit in doing both integers and floating-point

or real numbers. It will be the same concept

from parsing point of view. From scanning point of view

[inaudible] and we know how to rules for integers and how to

write rules for floating points. But from parsing point of

view, it’s a number is a number so in this, in our treatment of

parsing, we’ll just say a number and we will limit

ourselves to integers but it’s not very limited because floating points

will be treated the same way by the parser. And this is our assignment. And this is our semi

colon or end of statement. Now what’s interesting in the

compiler is that these parts of speech that the

scanner identifies, these will become the

symbols of the alphabet that the parser deals with. So our alphabet at the parsing

level, the alphabet is going to consist of an identifier. So this whole identifier is

just one symbol of the alphabet and this comes from the scanner. A number is another

symbol of the alphabet. You know the equal is another

symbol of the alphabet plus, minus, multiply, if, else. So from parsing point of view, each one of these

would be viewed as a single symbol

of the alphabet. So this is one symbol. This is one symbol

in the alphabet. So there are two levels. There is the scanning level

where we do the micro syntax and we identify the

words of the language and there is the parsing level

and at the parsing level, these words become individual

symbols in that language that we will be trying to recognize using

context free grammars. Of course you know, some

of them will be just individual characters. So this symbol is just

an individual character. But this from parsing point

of view, this will be treated as a single symbol and this

symbol is the key work else. So the scanner will

return to the parser. Basically, the way

this works is, the parser will call

the scanner. The parser calls the scanner and the scanner will

return a part of speech. So it will return you know,

identifier, number whatever, and it will return also the

actual lexeme or a clearer way of doing this is get next word. So the parser does

get next word. It’s calling the scanner to get

next word in the input stream. So with this example, the first

time this get next word will return an identifier. It will say, okay the

first word is an identifier and this is an actual lexeme. Then when the parser calls

get next word the second time, it will return this. It will tell it the next word

is the assignment operator and the third word is a number. Okay. Yes?>>I have two questions. So the scanner doesn’t

operate on the entire input and then just pass it

as a list to the parser. It actually gets

called for each word.>>Yes.>>Okay.>>Well in principle, you can

do it both ways but in practice, yes, it will return

one word at a time.>>Okay.>>The scanner returns

one word at a time.>>Okay. The second question is, when the scanner encounters

something like a number, does it do the conversion into

an actual number format or is that the job of the parser?>>Okay so all that the scanner

does is that it identifies this as a syntactic category. It says the syntactic

category number and then it will give the parser

the actual scheme, the lexeme, I’m sorry, which is 579, okay. So this is the lexeme and then

the parser deals with this. In fact you’re doing the actual

conversion is neither scanning nor parsing. Scanning is about identifying

words and parsing is about identifying

sentences or statements. But yeah, once this has

been identified as a number, we can do the conversion. It can be done as a, you know,

it can be done in scanning or in parsing but in

principle, it does not belong to either scanning or

parsing, conceptually. Okay. So now this is

going to be our alphabet but in today’s lecture, we will

focus on a simple alphabet. The basic alphabet that you

have probably seen in 135, which is the alphabet AB. The alphabet that consists

of two characters AB only to review the concept of

context free grammars. So in today’s lecture we’ll

focus on this alphabet but in subsequent lectures, this

is what we will be focusing, this is what our

alphabet will be. Okay. So you know, a

context free grammar, how would you describe

a grammar? Since this is a review, since

we are reviewing grammars, how would you describe

it or define it? Yeah.>>It’s supposed to have

rules of the language.>>Yeah. It’s a set of rules. Basically a grammar is a set of

rules for defining a language that recognize or that

generates a certain language and the language,

a formal language, when we say language

here in this context, it’s just a set of strings. Formal language is

a set of strings. So it’s a set of rules that

generate all the strings that belong to a given language. A set of rules that

generate all the strings that belong to a given language. Now that language and again, I’m

talking about formal languages, if that language is regular

then we don’t need grammars. We can recognize that language

using a finite automata or a regular expression. In fact, we can use grammars because grammars are more

general and the special type of grammars that in generally

regular languages, we call them, what do we call the grammars

that generate regular languages?>>Regular grammars.>>Regular grammars. Exactly so there are, there

is a subset of that grammars or subset of context

free grammars. In fact context free grammar

are also a subset of the big set of grammar context

sensitive grammars. But a subset of context

free grammars is the set of regular grammars that

recognize regular language. Now in here we focus on those

languages that are non-regular. That cannot be recognized

by finite automata and regular expressions. And maybe the classical example of a non-regular

language is this language. A set of strings W such that or

in fact I’ll write it this way. The language…ANBN of course

like I said in today’s lecture, we’ll focus on this alphabet AB. Such that N is greater than 0. So this is an example of

a non-regular language. It’s a language that cannot

be recognized by automata and regular expressions but can

be generated using a context free grammar. And the context free

grammar for this language is S So does anyone remember this? This is probably the first

example that you have seen in context free grammars. ASB or –>>Empty>>– or Epsilon. So we do need it. So this is a context

free grammar. Now how many rules for productions does

this grammar have?>>One>>Two. It has two. So this is one production. In fact, you know, these

are two productions that are merged together. We could have written

this as you know, as is ASB or S is Epsilon. So we could have written this as two separate rules

or productions. So each one of these is called

the rule or a production but it will usually, it’s more

convenient to write it this way. If there are multiple rules

that define the same symbol, we can do them this way. Where this bar stands for what?>>Four.>>Four. So this stands for all. So it’s like S is

either ASB or Epsilon. Okay? We can write it this way

or we can write it this way. Okay. Now in this case,

you know a grammar, what do we call the S?>>Non terminal?>>Okay, it’s the non-terminal

but it’s a special non-terminal.>>The start.>>The start. Yeah. So the grammar has, I think we can, have

to do this now. So a grammar consists of VTP

and S. So it’s a quadruple that consists of four sets. The V is the set of variables. Set of variables or

there’s another name for variables in this context. Variables are non-terminals. So variables are non-terminals. Non-terminals and

T is a set of –>>Terminals.>>– terminals. P –>>The production.>>– set of productions.>>Yeah. And S, there is

always one distinct variable.>>The start variable.>>The start variable. So in this case,

in this example, we only have one variable which

is S. We have two terminals A and B and S, the only

variable that we have happen to be the start variable. And we have two productions

that are written in this way. Okay? So this is an

example of a grammar. We’ll see more grammar

examples in this lecture. Now with the terminals, the point of grammar

is you can also view it as the alphabet for the grammar. It’s also the set of symbols

that can appear in a string that belongs to this language. Or a string that

derives from this grammar or generated by this grammar. Now in the context of

programming languages, is our sigma, like I said, it will not be AB it will

be sigma will be added to fire number, the

different key words, the different operators. Okay, so now let’s look at this. Let’s do what we

call derivations. So let’s derive derivation

for AA, AB, DB. So how do we derive it? We start with the start

symbol or start variable. Then in each step, we

make a substitution. So in order to derive this

string from this grammar, the first substitution is

going to be ASB and of course, you know terminals, you can

substitute for terminals. By the way, the here

I’m using the convention of following the convention of

writing terminals in lowercase and variable in uppercase so this is the convention

that I’m using. Terminals are in lowercase. Variables are in uppercase. So in our convention,

here is our lowercase and these are the

variables are uppercase. There are other conventions. You know, some people use, some

people underline terminals. So there is another

convention for writing grammars where people underline

terminals. So in this case we

substitute ASB for S, so at any step you

can substitute for S, either this or this. There are two possible

substitutions because you have two

possible rules or productions. And for this S we

also substitute ASB and we do it again, ASB

when the final step we substitute Epsilon. So the string that we have

just derived is this string. AAABBB. So the string

that gets derived consists of the leaves of the tree. So this is what we

call derivation tree. This is the derivation

tree and the string that gets derived would

have been on the leaves of this derivation tree

and we need it in order. Left correct. So this is basically

going to be, you know this is the first

symbol in the string. This is the second, third,

fourth, fifth, sixth. Okay? So this is the string

that we have just derived. So this is a derivation tree and

remember, you know in compilers, our goal is to read a

program written in a source, some high level language and construct what you

call the abstract syntax. Now the abstract syntax

tree is going to come from this derivation tree or

parse tree and the difference between the abstract syntax

tree and the parse tree is that in the abstract syntax

tree, we only keep the nodes that correspond to actual

operations and we get rid of the nodes that correspond

to grammatical symbols. You know we don’t care about

symbols in the grammar. We care about nodes

that correspond to actual operations you

know, such as add or multiply; operations that will get

translated into actual code because we are building

the abstract syntax tree in order to generate code. We will get to that. So don’t have to fully

understand the difference between derivation tree and the

abstract syntax tree in fact, until we are done with parsing. And parsing, like I said,

we’ll take three weeks. So this week we’ll basically

learn how to write grammars for actual programming

language constructs and the second week we will look

into one approach to parsing which it top down parsing

and the third week, we will look at watermark

parsing. So there are two

approaches to parsing. Top down and watermark. So each one of them will

take probably a week. Okay. So this is an example. By the way, you can also do the

derivation in a textual format so you can do the

derivation like this. S then you substitute for S ASB

then you substitute for this as another ASB so this

also derives ASBB. So what you did here is that you

substituted for this S. So this but substituted for

this based on this rule because I have a rule here

that says, I can substitute ASB for S. Then there is a third

step where we do AAAS, so I, yeah this should be a B. DBB? And again I have substituted now for this S. I have

substituted this and last step is substituting

and Epsilon for the S. So I would end up AAB, usually

we don’t write the Epsilon. You can write the Epsilon but

usually we don’t write it. And this leads to the string

that we are trying to derive. Okay, so let’s look at

another example now. So, how about M and greater than

or equal to 0 and M is greater than N. Now we another

grammar for this language. Okay. So a grammar for

this language, the language of those A’s and B’s

were the number these. Of course, A’s followed

by B’s where the number of B’s is greater than

the number of A’s. In fact, there are two

ways of thinking of this. So you can think of it this way. AABBBB, so you can think of it

as this is a string that belongs to the language of

equal A’s and B’s and then we are adding

some additional B. So this is one way

of thinking of it. Another way of thinking

this is the following. You can think of it as AABBBB

and you can think of it as this being the

core or the center or the middle part

of that string. So this is a string that

has more B’s than A’s. A string in which, you know,

the A’s appear before the B’s and there are more B’s than A’s. There are two different ways of

thinking of this string and each of these two different

views will lead to a different grammar. So in this case, I can

construct a grammar for this by saying okay, this

grammar will consist of S1 followed by S2. Where S1 is a grammar

that belongs to the language of

equal A’s and B’s. So S1 is going to

be AS1V or Epsilon. So S1 is a string that belongs

to the language of equal A’s and B’s and S2 is

going to be what?>>B’s.>>All B’s. All B’s.>>So it’s going to

be S2B then S2 –>>Yeah so it’s going to be BS2. So this is that incursion. We can –>>4.>>4.>>4B Epsilon.>>No.>>No, if you add Epsilon

too, it would be wrong. Why?>>You have only have used 1.>>Yeah because then you can

substitute Epsilon for S2 which means that you can have

equal number of A’s and B’s. So if you substitute Epsilon for

this, you will have equal number of A’s and B’s and that string

does not belong to this language because in order for this string

to belong to this language, it has to have more

B’s than A’s. So Epsilon is wrong.>>Okay.>>But you know, if this had

been greater than or equal to, the Epsilon would

have been correct. Okay, so equality

is not allowed. So this is one view

of this string. So S1 is a string

with equal A’s and B’s and S2 is a string

with a bunch of B’s. Now here, we can look at this

as a string that has S is ASB or so this all gives you

the center, the center part of the string or

B where B is what?>>At least 1B>>2.>>Yeah so it can be like S2. You can call it S2 by the way. You can call it any name. B or C. So just okay. So let’s do a derivation

for each one of them to understand what’s going on. So let’s do a derivation

for this. So let’s derive this string. To derive this string, we will

have S and S consists of S1, S2. Okay. I think this will not fit

so we’ll …okay, okay yeah. Let’s do it here so S is

going to be S1, S2 right? So this is our S1 and this

is our S2 and S1 is going to derive two A’s and two B’s. So it’s AS1B, AS1B

and then Epsilon. So now we derive that equal

number of A’s and B’s. Then we derive for

S2, we can BS2. I’m sorry. It’s BS2. BS2 then

what should we do?>>Get another B.>>Another B>>Another BS2. Another BS2 then oh sorry. [multiple voices]. So we only need one B so we

have to substitute one B. So now this gives us equal

number of A’s and B’s and this gives us two extra B’s. And again, the derive string

is, you can read the derives, derive string by looking at

the leaves from left to right. So this going to be a simple

number 1, number 2, number 3, number 4, number 5, number 6. Look at the leaves of the tree. So the terminals that appear

on the leaves of the tree, if you read them from

left to right in order. Okay. Let’s do the

derivation here. So the derivation for

this is going to be what? So it’s going to be, you

know, this is our B. Okay? And in the middle we have ASB. So what we will do, ASB. Then we will do ASB. Then the central part is going

to be a B and this B is going to consist BD and this is FB. Okay? So here this is number 1. This is number 2. This is 1, 2, 3, 4, 5, 6. Okay? Everyone now remembers

context free grammars. By the way, what do we mean

by a context free grammar? Why are these grammars, both of

these grammars are context free? What is a context free grammar?>>It’s because the variable

can appear anywhere within, within the symbol string. [inaudible].>>Okay. No you are defining

a non-linear or non-regular.>>Yeah.>>Yeah. So this is, what

makes a grammar context free as opposed to context sensitive?>>There is only one

derivation per each string. There can’t be more than one.>>One derivation

for each string? No –>>For a given grammar.>>For a given grammar,

there is only a string of derivation for a string?>>Yeah.>>So that’s the

definition of what? You’re defining something else. There exists a single derivation for a string then the

grammar is called?>>Non-redundant?>>Non-ambiguous.>>Non-ambiguous. If there are multi

derivations, it’s ambiguous. If there is only a unique

derivation, it’s not ambiguous. So you are defining ambiguous

versus non-ambiguous. But what’s the definition

on context free versus context sensitive? Context free grammar, all

the variables that appear on the left hand side

or whatever appears on the left hand side is

an individual variable. It cannot be multiple variables. So you cannot have something

like you know, ASB is ASBD or something like this. So on the left hand

side, what you have on the left hand side

is a single variable. That makes it context free. If on the left hand side you

have variables and terminals, then in that case it will

be context sensitive. And context sensitive,

we’re not interested in context sensitive

grammars in this course because programming languages

can be defined using context free grammars. So that’s why we are

limiting ourselves to context free grammars

because that’s what we need to define the syntax of

a programming language. Just we need to know

what context sensitive. All what we need to know is

what does it mean for grammar to be context free as

opposed to context sensitive. Okay. So everything that we will

cover here is going to be free, a context free grammar for CFG. Although I hate to write

this in a compiler course, because CFG in a compiler course

is an acronym for something else which is control flow graph. So it’s, I hate to use this

acronym because it’s overloaded. It means context free grammars

and it means control flow graph. Okay. So let’s take

another example. So about this grammar. This is ASB or BSA

or SS or [inaudible]. So this is a grammar. How many productions

do we have here?>>Four.>>Four productions. So each one of these that

are connected with the or, each one of them is a

different production. You could have written

each production separately. So this is for production

1, production 2, production 3, production 4. And we still have one symbol

or one variable which is S and two terminals, A and

B. Now the question is, what’s the language

that is defined by or that is generated

by this grammar?>>Even length strings?>>Even ->>Palindrome?>>It’s a, well it’s

more respective. So this is an even length string that does not belong

to the language. So this is an even length string that does not belong

to the language. [Multiple Speakers]>>It can’t be almost. So it’s either a

palindrome or not. There is no almost.>>Same number of A’s and B’s?>>Yeah, exactly. So it’s the same

number of A’s and B’s. And, by the way, the grammar

for a palindrome is this. So palindrome is ASA or

BSBE or so if you want it to be an even length

palindrome, you put an Epsilon. If you want it to

be an odd length; if you want to accept odd length

palindromes also you can accept A and B. So this is

the palindrome grammar. Okay so for this language,

it’s the set of strings that W such that there is a notation

for number of A’s that you would like to use which is

NA of W equals AB of W and we will define an A here. NA of W is the number of

A’s in W. And of course, NB of W is the number

of B’s in W. Okay. So it’s the language that has

in which the set of strings in which the number of A’s is

equal to the number of B’s. Okay. So now, this then

let’s derive the for example, let’s get the, so this

is the palindrome. So let’s do a derivation

for , for BABBBA. So in order to derive this,

we can think of this as, so this is one S and

this is another S. Okay. So we derive this. This is one derivation. S is SS, two S’s. The left hand S is going

to be AB repeated twice. So it’s going to ASBASD. Then what should I

substitute for this?>>Epsilon.>>Then the left hand, the right

hand I would derive D as A okay. And then I would

substitute Epsilon. Okay? So this is one derivation. In fact, you know I

could have derived this. This is also another

valid derivation. I can put SS, this S I can

substitute an Epsilon then this S is going to be

the same as this. So this, I can put this here. So this means that I have

at least two derivations. When I have at least two

derivations for the same string, then the grammar is ambiguous. This is an ambiguous

grammar because this is at least another derivation

for the same string. I can do this and

then this S is going to have the same

string that I have here. Okay so you have multiple

derivations for the same string, we call this grammar,

ambiguous grammar. And now, oh so let’s, now

let’s make a distinction between a left most

grammar, left most derivation and a right most derivation. So in a left most derivation,

we make the substitution for the left most variable. In a right most, we make the

substitution for the right most. What does this mean? So in order to, let’s do the

derivation for this again. So for this we do S.

We substitute for SSS. When we do the derivation this

way, so let’s do left most. When we do the derivation

this way, we again, we make one substitution

at a time. So now the question is, should we first substitute

for this S or this S? If it’s a left most

derivation, then we substitute for this S. It’s a

right most derivation, we substitute for this first. So this is a left

most derivation. When we substitute for the

left in this case, substituting for the left is ASB and the

right S remains the same. Then again we substitute

for the left most variable which is this. So we’re going to

do again, ASBBS. Then we do, what’s the next

substitution in this case?>>Epsilon.>>Epsilon for this. So it’s going to

be AAEpsilonBBS. Then the last step is going to be…no there

will be more steps. So it’s, we’ll start

substituting from this S then we’ll have

AAEpsilonBBAS sorry BSA. Then the next or the last

step is AAEpsilonBBBEpsilonA. For this the left most

variation in which substituted for the left most

variable, whenever we have, this is the sentential

form by the way. This is sentential. It’s any, its immediate step in

the derivation that has a mix of terminals and non-terminals. It’s a sentential form

or its immediate form. So the first step, we

have the start symbol. At the very end, we

only have terminals and in the middle we have a mix

of terminals and non-terminals. Okay. So let’s do right most. Right most derivation. Okay. SS then we

substitute for the right. We substituted for

the right is going to be SDSA then this is

going to go to SDEpsilonB. So we did this part first. We did the right first. Now we are left with the, we’re

done with the right most part so now we substitute for the

S. So we will do ASBBEpsilonA. Then this derives

AASBDBEpsilonA. Then this at the end

this, we substitute for the SAEpsilonBDDEpsilonA. And then you can get

to know the Epsilon because it’s the empty string. Okay. So now the point

is, if a string has, usually a string will have

a left most derivation and a right most derivation. Unless the grammar is very

simple there will always be a left most and a right most. So having both the left most

and the right most doesn’t mean that the language ambiguous. A language is ambiguous if

you have two left most parts, more than one left

most derivation or more than one right most derivation. For in other words, you have

more than one derivation tree. If the derivation tree; the derivation tree does

not specify the ordering for the derivation. So this derivation tree is

not left most or right most. You can’t call it left

most or right most because we don’t have

the notion of ordering. The ordering of the

substitution. So this is, this tree, this

tree corresponds to both, the left most and

the right most; so both of these derivations

correspond to this tree. So a grammar is ambiguous

if there are multiple trees. There are multiple

derivation trees. Okay. Now why do we

care about derivation? Because in fact you know

what parsing is all about. It’s about finding a

derivation for a string. So your program is a

string that belongs to that programming language

and the parser is trying to find the derivation for that. Given the rules, the productions

or the rules that we will write to define the programming

language, a parser is that stage in the compilation that looks

at the rules and the program which is just a string of

characters and it tries to find a derivation

for that string. And so your program, even if

it’s a medium line program, is a string that should belong

to that, that must belong to that language to find

why the context free grammar for that programming language. And then it will construct a

derivation tree for that string and then that derivation

tree will become the abstract syntax tree. So parsing is automated

derivation. So it is a program that

does the derivation. So we will be studying

some algorithms for doing the derivation. Now we’re doing the

derivation by inspection. We don’t have a specific

algorithm for doing the derivation. We have small grammars

and small strings and we are doing our

derivation by inspection. But this will not work

for a compiler that needs to parse a potentially, a program with a

[inaudible] length. It needs to parse a program that

has millions of lines of code. So we need an algorithm

to do this, to automate the derivation, to

figure out a valid derivation for a string or to determine

that no derivation is possible and if no derivation is

possible, what does that mean?>>Compile error.>>Syntax errors.>>It’s a syntax error. The compiler cannot find

then that’s a syntax error. Okay. So this is what we will

be doing in the next lecture. Next lecture we will be, we

will start righting grammars for actual programming

language constructs. So today’s lecture was

about reviewing the concepts of context free grammars

and derivation for left most derivation,

right most derivation and because grammars

all these parsers and next lecture we will

start writing grammars for actual grammars construct and we’ll start with

expressions. On the [inaudible] of

constructs is an expression. Any questions on

today’s lecture? Yeah.>>In the compilers

doing the derivations, what’s the time complexity

on that? Is it like fast or –>>Yeah it’s fast. So it’s going to be,

yeah we will see. It’s very fast compared to

other states in that compiler.